Unicode surrogate pairs (Python 3.4)

Sun May 3 11:48:47 EDT 2015

On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens
<jon+usenet at unequivocal.co.uk> wrote:
>> You shouldn't even actually _have_ those in your string in the first
>> place. How did you construct/receive that data? Ideally, catch it at
>> that point, and deal with it there.
>
> That would, unfortunately, be "tell the Unicode Consortium to format
> their documents differently", which seems unlikely to happen. I'm
> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt

Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes
are in your input. I'm not sure what the best way to deal with that
is... it's a bit of a mess. You may find yourself needing to do
something manually, unless there's a way to ask Python to encode to
pseudo-UCS-2 that allows surrogates. Some languages may have sloppy
conversions available, but Python's seems to be quite strict (which is
correct). Is there an errors handler that can do this?

ChrisA