[Python-Dev] bytes / unicode

Mon Jun 21 20:59:26 CEST 2010

2010/6/21 Stephen J. Turnbull <stephen at xemacs.org>:
> Robert Collins writes:
>
>  > Also, url's are bytestrings - by definition;
>
> Eh?  RFC 3896 explicitly says

?Definitions of Managed Objects for the DS3/E3 Interface Type

Perhaps you mean 3986 ? :)

>    A URI is an identifier consisting of a sequence of characters
>    matching the syntax rule named <URI> in Section 3.
>
> (where the phrase "sequence of characters" appears in all ancestors I
> found back to RFC 1738), and

Sure, ok, let me unpack what I meant just a little. An abstract URI is
neither unicode nor bytes per se - see section 1.2.1 " A URI is a
sequence of characters from a very limited set: the letters of the
basic Latin alphabet, digits, and a few special characters. "

URI interpretation is fairly strictly separated between producers and
consumers. A consumer can manipulate a url with other url fragments -
e.g. doing urljoin. But it needs to keep the url as a url and not try
to decode it to a unicode representation.

The producer of the url however, can decode via whatever heuristics it
wants - because it defines the encoding used to go from unicode to URL
encoding.

As an example, if I give the uri "http://server/%c3%83", rendering
that as http://server/Ã is able to lead to transcription errors and
reinterpretation problems unless you know - out of band - that the
server is using utf8 to encode. Conversely if someone enters in
http://server/Ã in their browser window, choosing utf8 or their local
encoding is quite arbitrary and able to not match how the server would
represent that resource.

Beyond that, producers can do odd things - like when there are a
series of servers stacked and forwarding requests amongst themselves -
where they generate different parts of the same URL using different
encodings.

>    2.  Characters
>
>    The URI syntax provides a method of encoding data, presumably for
>    the sake of identifying a resource, as a sequence of characters.
>    The URI characters are, in turn, frequently encoded as octets for
>    transport or presentation.  This specification does not mandate any
>    particular character encoding for mapping between URI characters
>    and the octets used to store or transmit those characters.  When a
>    URI appears in a protocol element, the character encoding is
>    defined by that protocol; without such a definition, a URI is
>    assumed to be in the same character encoding as the surrounding
>    text.

Thats true, but its been taken out of context; the set of characters
permitted in a URL is a strict subset of characters found in  ASCII;
there is a BNF that defines it and it is quite precise. While it
doesn't define a set of octets, it also doesn't define support for
unicode characters - individual schemes need to define the mapping
used between characters define as safe and those that get percent
encoded. E.g. unicode (abstract) -> utf8 -> percent encoded.

See also the section on comparing URL's - Unicode isn't at all relevant.

>  > if the standard library has made them unicode objects in 3, I
>  > expect a lot of pain in the webserver space.
>
> Yup.  But pain is inevitable if people are treating URIs (whether URLs
> or otherwise) as octet sequences.  Then your base URL is gonna be
> b'mailto:stephen at xemacs.org', but the natural thing the UI will want
> to do is
>
>    formurl = baseurl + '?subject=うるさいやつだなぁ…'
>
> IMO, the UI is right.  "Something" like the above "ought" to work.

I wish it would. The problem is not in Python here though - and
casually handwaving will exacerbate it, not fix it. Modelling URL's as
string like things is great from a convenience perspective, but, like
file paths, they are much more complex difficult.

For your particular case, subject contains characters outside the URL
specification, so someone needs to choose an encoding to get them into
a sequence-of-bytes-that-can-be-percent-escaped.

Section 2.5, identifying data goes into this to some degree. Note a
trap - the last paragraph says 'when a *NEW* URI scheme...' (emphasis
mine). Existing schemes do not mandate UTF8, which is why the
producer/consumer split matters. I spent a few minutes looking, but
its lost in the minutiae somewhere - HTTP does not specify UTF8
(though I wish it would) for its URI's, and std66 is the generic
definition and rules for new URI schemes, preserving intact the
mistake of HTTP.

> So the function that actually handles composing the URL should take a
> string (ie, unicode), and do all escaping.  The UI code should not
> need to know about escaping.  If nothing escapes except the function
> that puts the URL in composed form, and that function always escapes,
> life is easy.

Arg. The problem is very similar to the file system problem:
 - We get given a sequence of bytes
 - we have some rules that will let us manipulate the sequence to get
hostnames, query parameters and so forth
 - and others to let use walk a directory structure
 - and no guarantee that any of the data is in any particular encoding
other than 'URL'.

In terms of sequence datatypes then, we can consider a few:
 - bytes
 - unicode
 - list-of-numbers
 - ...

unicode is  a problem because the system we're talking to is defined
to be a superset of unicode. People can shove stuff that fits into the
unused unicode plane, and its OK by the URL standard (for all that it
would be ugly). Having a part-unicode, part-bytes representation would
be pretty ugly IMO; certainly decoding only part of the URL would be
prone to the sorts of issues Python 2 had with str/unicode.

lists of numbers are really awkward to manipulate.

bytes doesn't suffer the unicode problem, it can represent everything
we receive, but it doesn't offer any particular support for getting a
unicode string *when one is available*.

> Of course, in real life it's not that easy.  But it's possible to make
> things unnecessarily hard for the users of your URI API(s), and one
> way to do that is to make URIs into "just bytes" (and "just unicode"
> is probably nearly as bad, except that at least you know it's not
> ready for the wire).

If Unicode was relevant to HTTP, I'd agree, but its not; we should put
fragile heuristics at the outer layer of the API and work as robustly
and mechanically as possible at the core. Where we need to guess, we
need worker functions that won't guess at all - for the sanity of folk
writing servers and protocol implementations.

-Rob