just a bug

Carsten Haese carsten at uniqsys.com
Fri May 25 08:42:47 EDT 2007


On Fri, 2007-05-25 at 04:03 -0700, sim.sim wrote: 
> my CDATA-section contains only symbols in the range specified for
> Char:
> Char ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
> [#x10000-#x10FFFF]
> 
> 
> filter(lambda x: ord(x) not in range(0x20, 0xD7FF), iMessage)

That test is meaningless. The specified range is for unicode characters,
and your iMessage is a byte string, presumably utf-8 encoded unicode.

Let's try decoding it:

>>> iMessage = "3c3f786d6c2076657273696f6e3d22312e30223f3e0a3c6d657373616\
... 7653e0a202020203c446174613e3c215b43444154415bd094d0b0d0bdd0bdd18bd0b5\
... 20d0bfd0bed0bfd183d0bbd18fd180d0bdd18bd18520d0b7d0b0d0bfd180d0bed181d\
... 0bed0b220d0bcd0bed0b6d0bdd0be20d183d187d0b8d182d18bd0b2d0b0d182d18c20\
... d0bfd180d0b820d181d0bed0b1d181d182d0b2d0b5d0bdd0bdd18bd18520d180d0b5d\
... 0bad0bbd0b0d0bcd0bdd15d5d3e3c2f446174613e0a3c2f6d6573736167653e0a0a".\
... decode('hex')
>>> iMessage.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 176-177: invalid data
>>> iMessage[176:178]
'\xd1]'

And that's your problem. In general you can't just truncate a utf-8
encoded string anywhere and expect the result to be valid utf-8. The
\xd1 at the very end of your CDATA section is the first byte of a
two-byte sequence that represents some unicode code-point between \u0440
and \u047f, but it's missing the second byte that says which one.

Whatever you're using to generate this data needs to be smarter about
splitting the unicode string. Rather than encoding and then splitting,
it needs to split first and then encode, or take some other measures to
make sure that it doesn't leave incomplete multibyte sequences at the
end.

HTH,

-- 
Carsten Haese
http://informixdb.sourceforge.net





More information about the Python-list mailing list