minidom and unicode errors

Tue Mar 7 01:24:39 EST 2006

On 3/7/06, Abhimanyu Seth <abhimanyu.seth at gmail.com> wrote:
>
> On 3/7/06, Fredrik Lundh <fredrik at pythonware.com> wrote:
>
> > Abhimanyu Seth wrote:
>
> > I'm trying to parse and modify an XML document using xml.dom.minidommodule
> > and Python 2.4.2
> >
> > >> from xml.dom import minidom
> > >> dom = minidom.parse ("c:/test.txt")
> >
> > If the xml file contains a non-ascii character, then i get a parse
> error.
> > I have the following line in my xml file:
> > <target>Exception beim Löschen des Audit-Moduls aufgetreten. Exception
> Stack
> > lautet: %1.</target>
> > ExpatError: not well-formed (invalid token): line 8, column 27
> >
> > If I remove the ö character, then it works fine. I'm guessing this has
> to do
> > with the default encoding which is ascii. I guess i can change the
> encoding
> > by modifying a file on my machine that the interpretter reads while
> loading,
> > but then how do I get my program to work on different machines?
>
> the default encoding for XML is UTF-8.  If you're using any other encoding
>
> in your XML file, you have to specify that in the file itself, by putting
> an
> <?xml?> construct at the top of the file.  e.g.
>
>     <?xml version="1.0" encoding="ISO-8859-1"?>
>     ... rest of XML file follows ...
>
> > Also, while writing such a special character to the file, I get an
> error.
> > >> document.writexml (file (myFile, "w"), encoding='utf-8')
> >
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
> position
> > 16: ordinal not in range(128)
>
> not sure; maybe you've added byte strings (encoded strings instead of
> Unicode
> strings) to the document, or maybe there's a bug in minidom.  What happens
> if
> you remove the encoding argument?  If you still get the same error after
> doing
> that, make sure you use only Unicode strings when you add stuff to the
> document.
>
> hope this helps!
>
> </F>
>
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
> I've specified utf-8 in the xml header
> <?xml version="1.0" encoding="utf-8"?>
>
> In writexml (), even without specifying the encoding, I get the same
> error. That't why I tried manually specifying the encoding.
>
> But I managed to find a workaround.
> I got some clues from http://evanjones.ca/python-utf8.html
>
> According to the site,
>
> import codecs
> fileObj =
> codecs.open( "someFile", "r", "utf-8" )
> u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
>
> should return me a unicode string. But I still get an error.
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 407-410:
> invalid data
>
> I can't figure out why! Why can't it parse ö character as unicode?
>
> Anyway,
> >> f = codecs.open ("c:/test.txt", "r", "latin-1")
> >> dom = minidom.parseString (codecs.encode (f.read(), "utf-8"))
>
> works. But then I dunno if this will work for chinese or other unicode
> characters.
> How do I make my code read unicode files?
>
> Also, while writing the xml file, I now use codecs.open ()
> >> document.writexml (codecs.open (mFile, "w", "utf-8"), encoding="utf-8")
>
> IMHO, writexml should be taking care of this, instead of me having to use
> codecs. I guess this is a bug.
>
> --
> Regards,
> Abhimanyu
>

Actually, it doesn't work. I don't get any errors, but it doesn't write the
special characters. It's converted them to some gibberish.
ö has become Ã¶.
Now I'm stumped!

--
Regards,
Abhimanyu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20060307/fe8089da/attachment.html>