[Python-ideas] Python 3000 TIOBE -3%

Mon Feb 13 06:16:09 CET 2012

On Mon, Feb 13, 2012 at 2:50 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Masklinn writes:
>
>  > Why not open the file in binary mode in stead? (and replace `'*'`
>  > by `b'*'` in the startswith call)
>
> This will often work, but it's task-dependent.  In particular, I
> believe not just `.startswith(), but general regexps work with either
> bytes or str in Python 3.  But other APIs may not. and you're going to
> need to prefix *all* literals (including those in modules your code
> imports!) with `b`.  So you import a module that does exactly what you
> want, and be stymied by a TypeError because the module wants Unicode.
>
> This would not happen with Python 2, and there's the rub.

The other trap is APIs like urllib.parse which explicitly refuse the
temptation to guess when it comes to bytes data, and decodes it as
"ascii+strict". If you want it to do something else that's more
permissive (e.g. "latin-1" or "ascii+surrogateescape") then you *have*
to decode it to Unicode yourself before handing it over.

Really, Python 3 forces programmers to learn enough about Unicode to
be able to make the choice between the 4 possible options for
processing ASCII-compatible encodings:

1. Process them as binary data. This is often *not* going to be what
you want, since many text processing APIs will either only accept
Unicode, or only pure ASCII, or require you to supply encoding+errors
if you want them to process binary data.

2. Process them as "latin-1". This is the answer that completely
bypasses all Unicode integrity checks. If you get fed non-ASCII data,
you *will* silently produce gibberish as output.

3. Process them as "ascii+surrogateescape". This is the *right* answer
if you plan solely to manipulate the text and then write it back out
in the same encoding as was originally received. You will get errors
if you try to write a string with escaped characters out to a
non-ascii channel or an ascii channel without surrogateescape enabled.
To write such strings to non-ascii channels (e.g. sys.stdout), you
need to remember to use something like "ascii+replace" to mask out the
values with unknown encoding first. You may still get hard to debug
UnicodeEncodeError exceptions when handed data in a non-ASCII
compatible encoding (like UTF-16 or UTF-32), but your odds of silently
corrupting data are fairly low.

4. Get a third party encoding guessing library and use that instead of
waving away the problem of ASCII-incompatible encodings.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia