Mysterious xml.sax Encoding Exception

John Machin sjmachin at lexicon.net
Sat Feb 2 04:19:59 EST 2008


On Feb 2, 8:12 am, JKPeck <JKP... at gmail.com> wrote:
> On Feb 1, 1:51 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
>
> > > They sent me the actual file, which was created on Windows,  as an
> > > email attachment.  They had also sent the actual dataset from which
> > > the XML was generated so that I could generate it myself using the
> > > same version of our app as the user has.  I did that but did not get
> > > an exception.
>
> > So are you sure you open the file in binary mode on Windows?
>
> > Regards,
> > Martin
>
> In the real case, the xml never goes through a file but is handed
> directly to the parser.  The api return a Python Unicode string
> (utf-16).

A Python unicode object is *NOT* the UTF-16 that the SAX parser is
expecting. It is expecting a str object which is Unicode text encoded
as UTF-16.

>>> unicode = u'abcde'
>>> unicode_obj = u'abcde'
>>> str_obj = unicode_obj.encode('UTF-16')
>>> print repr(unicode_obj)
u'abcde'
>>> print repr(str_obj)
'\xff\xfea\x00b\x00c\x00d\x00e\x00'
>>>

At the end of this post is code using a str object (works) then
attempting to use a unicode object (reproduces your error message).

> For the file the user sent, if I open it in binary mode, it
> still has a BOM; otherwise the BOM is removed.  But either version
> works on my system.
>
> The basic fact, though, remains, the same code works for me with the
> same input but not for two particular users (out of hundreds).

If the real case doesn't involve a file, I can't imagine what you can
infer from a file that isn't used [strike 1] sent to you by a user
[strike 2].

Consider trapping the exception, write repr(the_xml_document_string[:
80]) to the log file and re-raise the exception. Get the user to run
the app. You inspect the log file.

Here's the promised code and results.

C:\junk>type utf16sax.py
import xml.sax, xml.sax.saxutils
import cStringIO
asciistr = 'qwertyuiop'
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
unicode_doc = (xml_template % ('UTF-16', asciistr)).decode('ascii')
utf16_doc = unicode_doc.encode('UTF-16')
for doc in (utf16_doc, unicode_doc):
    print
    print 'doc = ', repr(doc)
    print
    f = cStringIO.StringIO()
    handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
    xml.sax.parseString(doc, handler)
    result = f.getvalue()
    f.close()
    start = result.find('<data>') + 6
    end = result.find('</data>')
    mydata = result[start:end]
    print "SAX output (UTF-8): %r" % mydata


C:\junk>utf16sax.py

doc =  '\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i
\x00o\x00n\x0
0=\x00"\x001\x00.\x000\x00"\x00 \x00e\x00n\x00c\x00o\x00d\x00i\x00n
\x00g\x00=\x0
0"\x00U\x00T\x00F\x00-\x001\x006\x00"\x00?\x00>\x00<\x00d\x00a\x00t
\x00a\x00>\x0
0q\x00w\x00e\x00r\x00t\x00y\x00u\x00i\x00o\x00p\x00<\x00/\x00d\x00a
\x00t\x00a\x0
0>\x00'

SAX output (UTF-8): 'qwertyuiop'

doc =  u'<?xml version="1.0" encoding="UTF-16"?><data>qwertyuiop</
data>'

Traceback (most recent call last):
  File "C:\junk\utf16sax.py", line 13, in <module>
    xml.sax.parseString(doc, handler)
  File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
    parser.parse(inpsrc)
  File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: encoding
specified in XML
 declaration is incorrect

I guess what is happening is that the unicode is coerced to str using
the default encoding (ascii) then it looks at the result, parses out
the "UTF-16", attempts to decode it using utf-16, fails, complains.

HTH,
John



More information about the Python-list mailing list