Unicode surrogate pairs (Python 3.4)

Jon Ribbens jon+usenet at unequivocal.co.uk
Sun May 3 12:30:40 EDT 2015


On 2015-05-03, Chris Angelico <rosuav at gmail.com> wrote:
> On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens
><jon+usenet at unequivocal.co.uk> wrote:
>> That would, unfortunately, be "tell the Unicode Consortium to format
>> their documents differently", which seems unlikely to happen. I'm
>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>
> Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes
> are in your input.

Well, they were, but I already wrote code to convert them into the
strings I showed in my original post.

> I'm not sure what the best way to deal with that is... it's a bit of
> a mess. You may find yourself needing to do something manually,
> unless there's a way to ask Python to encode to pseudo-UCS-2 that
> allows surrogates. Some languages may have sloppy conversions
> available, but Python's seems to be quite strict (which is correct).
> Is there an errors handler that can do this?

I did some experimentation, and it looks like the answer is:

  "\udb40\udd9d".encode("utf16", "surrogatepass").decode("utf16")

Thanks for your help!



More information about the Python-list mailing list