unicode surrogates in py2.2/win

Tue Mar 8 03:18:58 EST 2005

Mike Brown wrote:
> Very strange how it only shows up after the 1st import attempt seems to 
> succeed, and it doesn't ever show up if I run the code directly or run the 
> code in the command-line interpreter.

The reason for that is that the Python byte code stores the Unicode
literal in UTF-8. The first time, the byte code is generated, and an
unpaired surrogate is written to disk. The next time, the compiled byte
code is read back in, and the codec complains about the unpaired
surrogate.

> Can anyone tell me what's causing this, or point me to a reference to show 
> when it was fixed? 

In Misc/NEWS, we have, for 2.3a1:

- The UTF-8 codec will now encode and decode Unicode surrogates
   correctly and without raising exceptions for unpaired ones.

Essentially, Python now allows surrogates to occur in UTF-8 encodings.

 > I'm using 2.2.1 and I couldn't find mention of it in any
> release notes up through 2.3. Any other comments/suggestions (besides "stop 
> supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks 
> :)

I see two options. One is to compile the code with exec, avoiding byte
code generation. Put

exec """

before the code, and

"""

after it. The other option is to use variables instead of literals:

surr1 = unichr(0xd800)
surr2 = unichr(0xdc00)
surr3 = unichr(0xe000)
def chars(s, surr1=surr1, surr2=surr2, surr3=surr3):
...
     if surr1 <= i < surr2:
         ...

I would personally go with "stop supporting Py 2.2". Unless you have the
time machine, you can't fix the bugs in old Python releases, and it is
a waste of time (IMO) to uglify the code just to work around limitations
in older interpreter versions.

Regards,
Martin