Unicode surrogate pairs (Python 3.4)

Jon Ribbens jon+usenet at unequivocal.co.uk
Sun May 3 12:26:27 EDT 2015


On 2015-05-03, MRAB <python at mrabarnett.plus.com> wrote:
> On 2015-05-03 16:32, Jon Ribbens wrote:
>> That would, unfortunately, be "tell the Unicode Consortium to format
>> their documents differently", which seems unlikely to happen. I'm
>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>>
> That document looks like it's encoded in UTF-8.

It is. But it also, for reasons best known to the Unicode Consortium,
contains strings of the form \uXXXX which need to be parsed into the
appropriate character, and some of *those* are then surrogate pairs,
which need to be further converted.



More information about the Python-list mailing list