just a bug

Maksim Kasimov maksim.kasimov at gmail.com
Fri May 25 10:11:02 EDT 2007


Carsten Haese:

> On Fri, 2007-05-25 at 04:03 -0700, sim.sim wrote: 
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 176-177: invalid data
>>>> iMessage[176:178]
> '\xd1]'
> 
> And that's your problem. In general you can't just truncate a utf-8
> encoded string anywhere and expect the result to be valid utf-8. The
> \xd1 at the very end of your CDATA section is the first byte of a
> two-byte sequence that represents some unicode code-point between \u0440
> and \u047f, but it's missing the second byte that says which one.


in previous message i've explain already that the situation widely appears with
memory limited devices, such as mobile terminals of Nokia, SonyEriccson, Siemens and so on.

and i've notice you that it is a part of a splited string.

Splited content it is a _standard_ in mobile world, and well described at http://www.openmobilealliance.org
and is _not_ contradicts xml-spec.


the problem is that pyexpat works _unproperly_.

> 
> Whatever you're using to generate this data needs to be smarter about
> splitting the unicode string. Rather than encoding and then splitting,
> it needs to split first and then encode, or take some other measures to
> make sure that it doesn't leave incomplete multibyte sequences at the
> end.
> 
> HTH,
> 


-- 
Maksim Kasimov



More information about the Python-list mailing list