[Python-ideas] Processing surrogates in

Sat May 16 09:50:41 CEST 2015

On 15 May 2015 at 22:21, Paul Moore <p.f.moore at gmail.com> wrote:
> On 15 May 2015 at 02:02, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>> (3) Problem: Code you can't or won't fix buggily passes you Unicode
>>     that might have surrogates in it.
>>     Solution: text-to-text codecs (but I don't see why they can't be
>>     written as encode-decode chains).
>>
>> As I've written before, I think text-to-text codecs are an attractive
>> nuisance.  The temptation to use them in most cases should be refused,
>> because it's a better solution to deal with the problem at the
>> incoming boundary or the outgoing boundary (using str<->bytes codecs).
>
> One case I'd found a need for text->text handling (although not
> related to surrogates) was taking arbitrary Unicode and applying an
> error handler to it before writing it to a stream with "strict"
> encoding. (So something like "arbitrary text".encode('latin1',
> 'errors='backslashescape').decode('latin1')).
>
> The encode/decode pair seemed ugly, although it was the only way I
> could find. I could easily imagine using a "rehandle" type of function
> for this (although I wouldn't use the actual proposed functions here,
> as the use of "surrogate" and "astral" in the names would lead me to
> assume they were inappropriate).

That's a different case, as you need to know the encoding of the
target stream in order to know which code points that codec can't
handle. Even when you do know the target encoding, Python itself has
no idea which code points a given text encoding can and can't handle,
so the only way to find out is to try it and see what happens.

The unique thing about the surrogate case is that *no* codec is
supposed to encode them, not even the universal ones:

>>> '\ud834\udd1e'.encode("utf-8")
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud834' in
position 0: surrogates not allowed

>>> '\ud834\udd1e'.encode("utf-16-le")
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud834'
in position 0: surrogates not allowed

>>> '\ud834\udd1e'.encode("utf-32")
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-32' codec can't encode character '\ud834' in
position 0: surrogates not allowed

The fact that it's purely a code point level manipulation of the
entire surrogate range (rehandle_surrogatepass), or a particular usage
pattern of that range (rehandle_surrogateescape) is the difference
that makes it possible to define text->text APIs for surrogate
manipulation without caring about the eventual text encoding used (if
any).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia