[Python-ideas] Processing surrogates in

Wed May 6 09:56:36 CEST 2015

Nick Coghlan writes:

 > If a developer only cares about Windows, Mac OS X, or modern systemd
 > based *nix systems that use UTF-8 as the system locale, and they never
 > set "LANG=C" before running a Python program, then these new functions
 > will be completely irrelevant to them.

"Irrelevant" is wildly optimistic.  They are a gift from heaven for
programmers who are avoiding developing Unicode skills.  Don't tell me
those skills are expensive -- I know, I sweat blood and spilt milk to
acquire them.  Nevertheless, without acquiring a modicum of those
skills, use of these proposed APIs is just what Ezio described:
applying any random thing that might work, to shut up those annoying
Unicode errors.  But these *will* *appear* to work, because they are
*designed* to smuggle the unprintable all the way to the output medium
by giving it a printable encoding.  You'll only find out that it was
done incorrectly when the user goes "achtung! mojibake!", and that
will be way too late.

 > If, however, a developer wants to handle "LANG=C", or other non-UTF-8
 > locales reliably across the full spectrum of *nix systems in Python 3,
 > they need a way to cope with system data that they *know* has been
 > decoded incorrectly by the interpreter,

But if so, why is this being discussed as a visible addition to the
Python API?  AFAICS, .decode('ascii', errors=surrogateescape) plus
some variant on

for encoding in plausible_encoding_by_likelihood_list:
    try:
        s = input.encode('ascii', errors='surrogateescape')
        s = s.decode(encoding, errors='strict')
        break
    except UnicodeError:
        continue

is all you really need inside of the Python init sequence.  That is
how I read your opinion, too.

 > The other suggested functions are then more about providing a "peek
 > behind the curtain" API for folks that want to *use Python* to explore
 > some of the ins and outs of Unicode surrogate handling.

I just don't see a need.  .encode and .decode already give you all the
tools you need for exploring, and they do so in a way that tells you
via the type whether you're looking at abstract text or at the
representation.  It doesn't get better than this!

And if the APIs merely exposed the internal representation that would
be one thing.  But they don't, and the people who are saying, "I'm not
an expert on Unicode but this looks great!" are clearly interested in
mutating str instances to be something more palatable to the requisite
modules and I/O systems they need to use, but which aren't prepared for
astral characters or proper handling of surrogateescapes.

 > I can't actually think of a practical purpose for them other than
 > teaching people the basics of how Unicode representations work,

I agree, but it seems to me that a lot of people are already scheming
to use them for practical purposes.  Serhiy mentions tkinter, email,
and wsgiref, and David lusts after them for email.