[Python-Dev] bytes / unicode

Mon Jun 21 18:56:11 CEST 2010

On Tue, Jun 22, 2010 at 01:08:53AM +0900, Stephen J. Turnbull wrote:
> Lennart Regebro writes:
> 
>  > 2010/6/21 Stephen J. Turnbull <stephen at xemacs.org>:
>  > > IMO, the UI is right.  "Something" like the above "ought" to work.
>  > 
>  > Right. That said, many times when you want to do urlparse etc they
>  > might be binary, and you might want binary. So maybe the methods
>  > should work with both?
> 
> First, a caveat: I'm a Unicode/encodings person, not an experienced
> web programmer.  My opinions on whether this would work well in
> practice should be taken with a grain of salt.
> 
> Speaking for myself, I live in a country where the natives have
> saddled themselves with no less than 4 encodings in common use, and I
> would never want "binary" since none of them would display as anything
> useful in a traceback.  Wherever possible, I decode "blobs" into
> structured objects, I do it as soon as possible, and if for efficiency
> reasons I want to do this lazily, I store the blob in a separate
> .raw_object attribute.  If they're textual, I decode them to text.  I
> can't see an efficiency argument for decoding URIs lazily in most
> applications.
> 
> In the case of structured text like URIs, I would create a separate
> class for handling them with string-like operations.  Internally, all
> text would be raw Unicode (ie, not url-encoded); repr(uri) would use
> some kind of readable quoting convention (not url-encoding) to
> disambiguate random reserved characters from separators, while
> str(uri) would produce an url-encoded string.  Converting to and from
> wire format is just .encode and .decode, then, and in this country you
> need to be flexible about which encoding you use.
> 
> Agreed, this stuff is really annoying.  But I think that just comes
> with the territory.  PJE reports that folks don't like doing encoding
> and decoding all over the place.  I understand that, but if they're
> doing a lot of that, I have to wonder why.  Why not define the one
> line function and get on with life?
> 
> The thing is, where I live, it's not going to be a one line function.
> I'm going to be dealing with URLs that are url-encoded representations
> of UTF-8, Shift-JIS, EUC-JP, and occasionally RFC 2047!  So I need an
> API that explicitly encodes and decodes.  And I need an API that
> presents Japanese as Japanese rather than as line noise.
> 
> Eg, PJE writes
> 
>     Ugh.  I meant: 
> 
>     newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1')
> 
>     Which just goes to the point of how ridiculous it is to have to  
>     convert things to strings and back again to use APIs that ought to  
>     just handle bytes properly in the first place. 
> 
> But if you need that "everywhere", what's so hard about
> 
> def urljoin_wrapper (base, subdir):
>     return urljoin(str(base, 'latin-1'), subdir).encode('latin-1')
> 
> Now, note how that pattern fails as soon as you want to use
> non-ISO-8859-1 languages for subdir names.  In Python 3, the code
> above is just plain buggy, IMHO.  The original author probably will
> never need the generalization.  But her name will be cursed unto the
> nth generation by people who use her code on a different continent.
> 
> The net result is that bytes are *not* a programmer- or user-friendly
> way to do this, except for the minority of the world for whom Latin-1
> is a good approximation to their daily-use unibyte encoding (eg, it's
> probably usable for debugging in Dansk, but you won't win any
> popularity contests in Tel Aviv or Shanghai).
> 
One comment here -- you can also have uri's that aren't decodable into their
true textual meaning using a single encoding.

Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp
components inside of their path but the textual representation that was intended
will be garbled (or be represented by escaped byte sequences).  For that
matter, apache will serve requests that have no true textual representation
as it is working on the byte level rather than the character level.

So a complete solution really should allow the programmer to pass in uris as
bytes when the programmer knows that they need it.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100621/8922f0e9/attachment-0001.pgp>