[Python-ideas] Processing surrogates in

Tue May 5 12:00:53 CEST 2015

On May 5, 2015, at 01:23, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> 
> Serhiy Storchaka writes:
> 
>> Use cases include programs that use tkinter (common build of Tcl/Tk 
>> don't accept non-BMP characters), email or wsgiref.
> 
> So, consider Tcl/Tk.  If you use it for input, no problem, it *can't*
> produce non-BMP characters.  So you're using it for output.  If
> knowing that your design involves tkinter, you deduce you must not
> accept non-BMP characters on input, where's your problem?

The real issue with tkinter (and similar cases that can't handle BMP) is that they're actually UCS-2, and we paper over that by pretending the interface is Unicode. Maybe it would be better to wrap the low-level interfaces in `bytes` rather than `str` and put an explicit `.encode('UCS-2')` in the higher-level interfaces (or even in user code?) to make the problem obvious and debuggable rather than just pretending the problem doesn't exist?

(I'm not sure if we actually have a UCS-2 codec, but if not, it's trivial to write--it's just UTF-16 without surrogates.)

> And ... you looked twice at your proposal?  You have basically
> reproduced the codec error handling API for .decode and .encode in a
> bunch to str2str "rehandle" functions.  In other words, you need to
> know as much to use "rehandle_*" properly as you do to use .decode and
> .encode.  I do not see a win for the programmer who is mostly innocent
> of encoding knowledge.  What you're going to see is what Ezio points
> out in issue18814:
> 
>    With Python 2 I've seen lot of people blindingly trying .decode
>    when .encode failed (and the other way around) whenever they were
>    getting an UnicodeError[...].
> 
>    I'm afraid that confused developers will try to (mis)use redecode
>    as a workaround to attempt to fix something that shouldn't be
>    broken in the first place, without actually understanding what the
>    real problem is.
> 
> If we apply these rehandle_* thumbs to the holes in the I18N dike,
> it's just going to spring more leaks elsewhere.
> 
>> See issue18814. It is not so easy to get desirable result.
> 
> That's because it is damn hard to get desirable results, end of story,
> nothing to see here, move along, people, move along!  The only way
> available to consistently get desirable results is a Swiftian "Modest
> Proposal": euthanize all those miserable folks using non-UTF-8
> encodings, and start the world over again.
> 
> Seriously, I see nothing in issue18814 except frustration.  There's no
> plausible account of how these new functions are going to enable naive
> programmers to get better results, just complaints that the current
> situation is unbearable.  I can't speak to wsgiref, but in email I
> think David is overly worried about efficiency: in most mail flows,
> the occasional need to mess with surrogates is going to be far
> overshadowed by spam/virus filtering and authentication (DKIM
> signature verification and DMARC/DKIM/SPF DNS lookups) on pretty much
> all real mailflows.
> 
> So this proposal merely amounts to reintroduction of the Python 2 str
> confusion into Python 3.  It is dangerous *precisely because* the
> current situation is so frustrating.  These functions will not be used
> by "consenting adults", in most cases.  Those with sufficient
> knowledge for "informed consent" also know enough to decode encoded
> text ASAP, and encode internal text ALAP, with appropriate handlers,
> in the first place.
> 
> Rather, these str2str functions will be used by programmers at the
> ends of their ropes desperate to suppress "those damned Unicode
> errors" by any means available.  In fact, they are most likely to be
> used and recommended by *library* writers, because they're the ones
> who are least like to have control over input, or to know their
> clients' requirements for output.  "Just use rehandle_* to ameliorate
> the errors" is going to be far too tempting for them to resist.
> 
> That Nick, of all people, supports this proposal is to me just
> confirmation that it's frustration, and only frustration, speaking
> here.  He used to be one of the strongest supporters of keeping
> "native text" (Unicode) and "encoded text" separate by keeping the
> latter in bytes.
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/