Cult-like behaviour [was Re: Kindness]

Marko Rauhamaa marko at pacujo.net
Sun Jul 15 04:39:40 EDT 2018


Steven D'Aprano <steve+comp.lang.python at pearwood.info>:

> Of course we have no idea what Marko's software is, or what it is doing, 

Correct, you don't, but the link Paul Rubin posted gives you an idea:

   Python 3 says: everything is Unicode (by default, except in certain
   situations, and except if we send you crazy reencoded data, and even
   then it's sometimes still unicode, albeit wrong unicode). Filenames
   are Unicode, Terminals are Unicode, stdin and out are Unicode, there
   is so much Unicode! And because UNIX is not Unicode, Python 3 now has
   the stance that it's right and UNIX is wrong

   <URL: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/>

> [Marko]
>>> No, as a large number of Python3 facilities require str objects as
>>> arguments. Consider urllib.request.urlopen(), for example, which
>>> requires a URL to be an str object.
>
> That's because URLs are fundamentally text strings.

<URL: https://tools.ietf.org/html/rfc1738>:

   In most URL schemes, the sequences of characters in different parts
   of a URL are used to represent sequences of octets used in Internet
   protocols. For example, in the ftp scheme, the host name, directory
   name and file names are such sequences of octets, represented by
   parts of the URL.

(RFC 3986 says the same thing in a more roundabout way.)

A URL consists of ASCII-only characters that represent an octet string.

Of course, ASCII characters *are* Unicode characters.

> Quick quiz: which of the following are real URLs?
> (a)  http://правительство.рф

On the face of it, that is not a valid URL. However, hostnames can be
dealt with somewhat bijectively using punycode.

But try this:

   >>> import http.client
   >>> conn = http.client.HTTPConnection("example.com")
   >>> conn.request("GET", "/ä")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib64/python3.5/http/client.py", line 1107, in request
       self._send_request(method, url, body, headers)
     File "/usr/lib64/python3.5/http/client.py", line 1142, in _send_request
       self.putrequest(method, url, **skips)
     File "/usr/lib64/python3.5/http/client.py", line 984, in putrequest
       self._output(request.encode('ascii'))
   UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in positi\
   on 5: ordinal not in range(128)
   >>> conn = http.client.HTTPConnection("example.com")
   >>> conn.request("GÄT", "/")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib64/python3.5/http/client.py", line 1107, in request
       self._send_request(method, url, body, headers)
     File "/usr/lib64/python3.5/http/client.py", line 1142, in _send_request
       self.putrequest(method, url, **skips)
     File "/usr/lib64/python3.5/http/client.py", line 984, in putrequest
       self._output(request.encode('ascii'))
   UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in positi\
   on 1: ordinal not in range(128)

IOW, the method and URL path given to conn.request are str objects but
they are really just thinly veiled containers for ASCII bytes objects.
That approach is very similar to mine.


Marko



More information about the Python-list mailing list