[Python-Dev] bytes / unicode

Stephen J. Turnbull stephen at xemacs.org
Tue Jun 22 09:33:53 CEST 2010


Glyph Lefkowitz writes:
 > On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote:

 > > Note also that the "complete solution" argument cuts both ways.  Eg, a
 > > "complete" solution should implement UTS 39 "confusables detection"[1]
 > > and IDNA[2].  Good luck doing that with bytes!
 > 
 > And good luck doing that with just characters, too.

I agree with you, sorry.  I meant to cast doubt on the idea of
complete solutions, or at least claims that completeness is an excuse
for putting it in the stdlib.

 > This is the limitation that everyone seems to keep dancing around.
 > If you are using the stdlib, with functions that operate on
 > sequences like 'str' or 'bytes', you need to choose from one of
 > three options: 

There's a *fourth* way: specially designed codecs to preserve as much
metainformation as you need, while always using the str format
internally.  This can be done for at least 100,000 separate
(character, encoding) pairs by multiplexing into private space with an
auxiliary table of encodings and equivalences.  That's probably
overkill.  In many cases, adding simple PEP 383 mechanism (to preserve
uninterpreted bytes) might be enough though, and that's pretty
plausible IMO.




More information about the Python-Dev mailing list