UnicodeEncodeError while reading xml file (newbie question)

John Machin sjmachin at lexicon.net
Sat Jun 7 20:50:57 EDT 2008


On Jun 8, 10:12 am, nikosk <nikos.nikos.nikos.ni... at gmail.com> wrote:
> I just spent a whole day trying to read an xml file and I got stuck
> with the following error:
>
> Exception Type:         UnicodeEncodeError
> Exception Value:        'charmap' codec can't encode characters in position
> 164-167: character maps to <undefined>
> Exception Location:     C:\Python25\lib\encodings\cp1252.py in encode,
> line 12
>
> The string that could not be encoded/decoded was: H_C="����" A_C
>
> After some tests I can say with confidence that the error comes up
> when python finds those greek characters after H_C="
>
> The code that reads the file goes like this :
>
> from xml.etree import ElementTree as ET
>
> def read_xml(request):
>     data = open('live.xml', 'r').read()
>     data = data.decode('utf-8', 'replace')
>     data = ET.XML(data)
>
> I've tried all the combinations of str.decode str.encode  I could
> think of but nothing.
>
> Can someone please help ?

Perhaps, with some more information:
(1) the *full* traceback
(2) what encoding is mentioned up the front of the XML file
(3) why you think you need to have "data.decode(.....)" at all
(4) why you think your input file is encoded in utf8 [which may be
answered by (2)]
(5) why you are using 'replace' (which would cover up (for a while)
any non-utf8 characters in your file)
(6) what "those greek characters" *really* are -- after fiddling with
encodings in my browser the best I can make of that is four capital
gamma characters each followed by a garbage byte or a '?'. Do
something like:

print repr(open('yourfile.xml', 'rb').read()[before_pos:after_pos])

(7) are you expecting non-ASCII characters after H_C= ? what
characters? when you open your xml file in a browser, what do you see
there?



More information about the Python-list mailing list