SAX unicode and ascii parsing problem

Adam Tauno Williams awilliam at whitemice.org
Wed Dec 1 08:33:28 EST 2010


On Tue, 2010-11-30 at 12:28 -0800, goldtech wrote: 
> I'm trying to parse an xml file using SAX. About half-way through a
> file I get this error:
> Traceback (most recent call last):
>   File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework
> \scriptutils.py", line 325, in RunScript
>     exec codeObject in __main__.__dict__
>   File "E:\sc\b2.py", line 58, in <module>
>     parser.parse(open(r'ppb5.xml'))
>   File "C:\Python26\Lib\xml\sax\expatreader.py", line 107, in parse
>     xmlreader.IncrementalParser.parse(self, source)
>   File "C:\Python26\Lib\xml\sax\xmlreader.py", line 123, in parse
>     self.feed(buffer)
>   File "C:\Python26\Lib\xml\sax\expatreader.py", line 207, in feed
>     self._parser.Parse(data, isFinal)
>   File "C:\Python26\Lib\xml\sax\expatreader.py", line 304, in
> end_element
>     self._cont_handler.endElement(name)
>   File "E:\sc\b2.py", line 51, in endElement
>     d.write(csv+"\n")
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 146-147: ordinal not in range(128)

Catch the UnicodeEncodeError exception and display the value of csv.

Are you certain the error isn't actually in your data?  What encoding is
the source data?

What is "d"?  A file object?  Is it in binary mode, or is it StringIO,
or a codec?

> I'm using ActivePython 2.6. I trying to figure out the simplest fix.
> If there's a Python way to just take the source XML file and covert/
> process it so this will not happen - that would be best. Or should I
> just update to Python 3 ?
> I tried this but nothing changed, I thought this might convert it and
> then I'd paerse the new file - didn't work:
> u = open(r'E:\sc\ppb4.xml').read().decode('utf8')
> ascii = uc.decode('ascii')
> mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
> mex9.write(ascii)
> Again I'm looking for something simple even it's a few more lines of
> codes...or upgrade(?)

If the input data contains characters that cannot be represented in
ASCII simply decoding the stream (a) won't fix it and (b) should raise
an exception.





More information about the Python-list mailing list