SAXParseException: not well-formed (invalid token)

Carsten Haese carsten at uniqsys.com
Thu Aug 30 09:47:21 EDT 2007


On Thu, 2007-08-30 at 15:20 +0200, Pablo Rey wrote:
> 	Hi Stefan,
> 
> 	The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8" 
> ?>).

It's possible that the encoding specification is incorrect:

>>> u = u"\N{LATIN SMALL LETTER E WITH ACUTE}"
>>> print repr(u.encode("latin-1"))
'\xe9'
>>> print repr(u.encode("utf-8"))
'\xc3\xa9'

If your input string contains the byte 0xe9 where your accented e is,
the file is actually latin-1 encoded. If it contains the byte sequence
0xc3,0xa9 it is UTF-8 encoded.

If the string is encoded in latin-1, you can transcode it to utf-8 like
this:

contents = contents.decode("latin-1").encode("utf-8")

HTH,

-- 
Carsten Haese
http://informixdb.sourceforge.net





More information about the Python-list mailing list