[Python-Dev] bytes / unicode

Wed Jun 23 02:25:31 CEST 2010

On Jun 22, 2010, at 2:07 PM, James Y Knight wrote:

> Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- all you really wanted to do is pass it from one API to another, with some well-defined transformations, which don't actually depend on it having being decoded properly. (For example, extracting the path from the URL and attempting to open it as a file on the filesystem.)

But you _do_ need to decode it in this case.  If you got your URL from some funky UTF-32 datasource, b"\x00\x00\x00/" is not a path separator, "/" is.  Plus, you should really be separating path segments and looking at them individually so that you don't fall victim to "%2F" bugs.  And if you want your code to be portable, you need a Unicode representation of your pathname anyway for Windows; plus, there, you need to care about "\" as well as "/".

The fact that your wire-bytes were probably ASCII(-ish) and your filesystem probably encodes pathnames as UTF-8 and so everything looks like it lines up is no excuse not to be explicit about your expectations there.

You may want to transcode your characters into some other characters later, but that shouldn't stop you from treating them as characters of some variety in the meanwhile.

> The surrogateescape method is a nice workaround for this, but I can't help thinking that it might've been better to just treat stuff as possibly-invalid-but-probably-utf8 byte-strings from input, through processing, to output. It seems kinda too late for that, though: next time someone designs a language, they can try that. :)

I can think of lots of optimizations that might be interesting for Python (or perhaps some other runtime less concerned with cleverness overload, like PyPy) to implement, like a UTF-8 combining-characters overlay that would allow for fast indexing, lazily populated as random access dictates.  But this could all be implemented as smartness inside .encode() and .decode() and the str and bytes types without changing the way the API works.

I realize that there are implications at the C level, but as long as you can squeeze a function call in to certain places, it could still work.

I can also appreciate what's been said in this thread a bunch of times: to my knowledge, nobody has actually shown a profile of an application where encoding is significant overhead.  I believe that encoding _will_ be a significant overhead for some applications (and actually I think it will be very significant for some applications that I work on), but optimizations should really be implemented once that's been demonstrated, so that there's a better understanding of what the overhead is, exactly.  Is memory a big deal?  Is CPU?  Is it both?  Do you want to tune for the tradeoff?  etc, etc.  Clever data-structures seem premature until someone has a good idea of all those things.