unicode surrogates in py2.2/win

Tue Mar 8 02:02:54 EST 2005

In mid-October 2004, Jeff Epler helped me here with this string iterator:

def chars(s):
    """
    This generator function helps iterate over the characters in a
    string. When the string is unicode and a surrogate pair is
    encountered, the pair is returned together, regardless of whether
    Python was built with UCS-4 ('wide') or UCS-2 code values for
    its internal representation of unicode. This function will raise a
    ValueError if it detects an illegal surrogate pair.
    """
    if isinstance(s, str):
        for i in s:
            yield i
        return
    s = iter(s)
    for i in s:
        if u'\ud800' <= i < u'\udc00':
            try:
                j = s.next()
            except StopIteration:
                raise ValueError("Bad pair: string ends after %r" % i)
            if u'\udc00' <= j < u'\ue000':
                yield i + j
            else:
                raise ValueError("Bad pair: %r (bad second half)" % (i+j))
        elif u'\udc00' <= i < u'\ue000':
                raise ValueError("Bad pair: %r (no first half)" % i)
        else:
            yield i

I have since discovered that I can't use it on Python 2.2 on Windows because 
of some weird module import bug caused by the surrogate code values expressed 
in the Python code as u'\ud800' and u'\udc00' -- apparently the string 
literals are being coerced to UTF-8 internally, which results in an invalid 
byte sequence upon import of the module containing this function.

A simpler test case demonstrates the symptom:

C:\dev\test>echo x = u'\ud800' > testd800.py

C:\dev\test>cat testd800.py
x = u'\ud800'

C:\dev\test>python -c "import testd800"

C:\dev\test>python -c "import testd800"
Traceback (most recent call last):
  File "<string>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

C:\dev\test>python testd800.py

C:\dev\test>python testd800.py

Very strange how it only shows up after the 1st import attempt seems to 
succeed, and it doesn't ever show up if I run the code directly or run the 
code in the command-line interpreter.

The error does not occur with u'\ud800\udc00' or u'\ue000' or any other valid 
sequence.

In my function I can use "if u'\ud7ff' > i ..." to work around the d800 case, 
but I can't use the same trick for the dc00 case. I will have to go back to 
calling ord(i) and comparing against integers. IIRC the explicit ord() call 
slowed things down a bit, though, so I'd like to avoid it if I can.

Can anyone tell me what's causing this, or point me to a reference to show 
when it was fixed? I'm using 2.2.1 and I couldn't find mention of it in any 
release notes up through 2.3. Any other comments/suggestions (besides "stop 
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks 
:)

-Mike