Unicode surrogate pairs (Python 3.4)

Chris Angelico rosuav at gmail.com
Sun May 3 11:48:47 EDT 2015


On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens
<jon+usenet at unequivocal.co.uk> wrote:
>> You shouldn't even actually _have_ those in your string in the first
>> place. How did you construct/receive that data? Ideally, catch it at
>> that point, and deal with it there.
>
> That would, unfortunately, be "tell the Unicode Consortium to format
> their documents differently", which seems unlikely to happen. I'm
> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt

Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes
are in your input. I'm not sure what the best way to deal with that
is... it's a bit of a mess. You may find yourself needing to do
something manually, unless there's a way to ask Python to encode to
pseudo-UCS-2 that allows surrogates. Some languages may have sloppy
conversions available, but Python's seems to be quite strict (which is
correct). Is there an errors handler that can do this?

ChrisA



More information about the Python-list mailing list