Unicode surrogate pairs (Python 3.4)

Jon Ribbens jon+usenet at unequivocal.co.uk
Sun May 3 11:32:14 EDT 2015


On 2015-05-03, Chris Angelico <rosuav at gmail.com> wrote:
> On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens
><jon+usenet at unequivocal.co.uk> wrote:
>> If I have a string containing surrogate pairs like this in Python 3.4:
>>
>>   "\udb40\udd9d"
>>
>> How do I convert it into the proper form:
>>
>>   "\U000E019D"
>>
>> ? The answer appears not to be "unicodedata.normalize".
>
> No, it's not, because Unicode normalization is a very specific thing.
> You're looking for a fix for some kind of encoding issue; Unicode
> normalization translates between combining characters and combined
> characters.
>
> You shouldn't even actually _have_ those in your string in the first
> place. How did you construct/receive that data? Ideally, catch it at
> that point, and deal with it there.

That would, unfortunately, be "tell the Unicode Consortium to format
their documents differently", which seems unlikely to happen. I'm
trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt

> But if you absolutely have to convert the surrogates, it ought to be
> possible to do a sloppy UCS-2 conversion to bytes, then a proper
> UTF-16 decode on the result.

Python doesn't appear to have UCS-2 support, so I guess what you're
saying is that I have to write my own surrogate-decoder? This seems
a little surprising.



More information about the Python-list mailing list