[Python-ideas] Processing surrogates in

Thu May 7 08:47:19 CEST 2015

On 6 May 2015 at 17:56, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Nick Coghlan writes:
>  > The other suggested functions are then more about providing a "peek
>  > behind the curtain" API for folks that want to *use Python* to explore
>  > some of the ins and outs of Unicode surrogate handling.
>
> I just don't see a need.  .encode and .decode already give you all the
> tools you need for exploring, and they do so in a way that tells you
> via the type whether you're looking at abstract text or at the
> representation.  It doesn't get better than this!
>
> And if the APIs merely exposed the internal representation that would
> be one thing.  But they don't, and the people who are saying, "I'm not
> an expert on Unicode but this looks great!" are clearly interested in
> mutating str instances to be something more palatable to the requisite
> modules and I/O systems they need to use, but which aren't prepared for
> astral characters or proper handling of surrogateescapes.
>
>  > I can't actually think of a practical purpose for them other than
>  > teaching people the basics of how Unicode representations work,
>
> I agree, but it seems to me that a lot of people are already scheming
> to use them for practical purposes.  Serhiy mentions tkinter, email,
> and wsgiref, and David lusts after them for email.

While I personally care about the OS boundary case, that's not the
only "the metadata cannot be fully trusted" case that comes up (and
yes, I know I'm contradicting what I posted yesterday - I hadn't
reread the issue tracker thread at that point, so I'd forgotten the
cases the others had mentioned, and hadn't even fully reloaded my own
rationale for wanting the feature back into my brain).

The key operation to be supported by the proposed APIs is to allow a
piece of code to interrogate a string object to ask: "Was this string
permissively decoded *and* did that process leave some invalid code
points in the string?".

Essentially, it's designed to cover the cases where the interpreter
(or someone else) is using the "surrogateescape" or "surrogatepass"
error handler when decoding some input data to text (I don't believe
the interpreter defaults to using surrogatepass anywhere, but we do
use surrogateescape in several places).

If your code has direct control over the decoding step, you don't need
anything new to deal with this appropriately, as you can just change
the error handling mode to "strict" and be done with it.

However, if you *don't* have control over the decoding step, then a)
you can't switch the decoding step to a different error handler (as
that's not happening in your code); and b) you don't necessarily know
what the assumed encoding was, so your best guess is going to be
"hopefully something ASCII compatible", which is going to introduce
all kinds of other complexity as you have to start considering what
happens for code points outside the surrogate area if you do an
encode()/decode() dance in order to apply a different error handler to
the smuggled surrogates.

Hence the rehandle_surrogatepass() and rehandle_surrogateescape()
methods: by default, they will both *throw an exception* if there is
improperly decoded data in the input, as they apply the "strict" input
error handler instead of whichever one was actually used. This lets
you control where such errors are detected (e.g. at the point where
the string is first given to your code), rather than having it happen
implicitly later when you attempt to encode those strings to bytes.

rehandle_surrogateescape() also has the virtue of scanning the
supplied string for *other* lone surrogates (created via
surrogatepass) and *always* complaining about them (again, at a point
you choose, rather than happening unexpectedly elsewhere in the code,
often as part of an IO operation).

The "errors" argument is then designed to let you apply an arbitrary
*input* error handler to surrogates that were originally let through
by "surrogatepass" or "surrogateescape" (again, the assumption here is
that you don't control the code that did the original decoding). If
you decide to throw that improperly decoded data away entirely, you
may use "replace" or "ignore" to clean it out. Alternatively, you may
use "backslashreplace" (which is now usable on decoding as well as on
encoding) to replace the unknown bytes with their hexadecimal
representation.

Regardless of which specific approach you take, handling surrogates
explicitly when a string is passed to you from an API that uses
permissive decoding lets you avoid both unexpected UnicodeEncodeError
exceptions (if the surrogates end up being encoded with an error
handler other than surrogatepass or surrogateescape) or propagating
mojibake (if the surrogates are encoded with a suitable error handler,
but an encoding that differs from the original).

As far as "handle_astrals()" and friends go, I previously suggested on
the issue that they could potentially be considered as a separate RFE,
as their practical applicability is likely to be limited to cases
where you need to deal with a UCS-2 (note: *not* UTF-16) API for some
reason. I think they highlight any interesting aspect of what
surrogate and astral code points *are*, but they don't have the same
input validation use case that rehandle_surrogatepass and
rehandle_surrogateescape do.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia