minidom and unicode errors

Tue Mar 7 01:21:16 EST 2006

On 3/7/06, Fredrik Lundh <fredrik at pythonware.com> wrote:
>
> Abhimanyu Seth wrote:
>
> > I'm trying to parse and modify an XML document using xml.dom.minidommodule
> > and Python 2.4.2
> >
> > >> from xml.dom import minidom
> > >> dom = minidom.parse ("c:/test.txt")
> >
> > If the xml file contains a non-ascii character, then i get a parse
> error.
> > I have the following line in my xml file:
> > <target>Exception beim Löschen des Audit-Moduls aufgetreten. Exception
> Stack
> > lautet: %1.</target>
> > ExpatError: not well-formed (invalid token): line 8, column 27
> >
> > If I remove the ö character, then it works fine. I'm guessing this has
> to do
> > with the default encoding which is ascii. I guess i can change the
> encoding
> > by modifying a file on my machine that the interpretter reads while
> loading,
> > but then how do I get my program to work on different machines?
>
> the default encoding for XML is UTF-8.  If you're using any other encoding
> in your XML file, you have to specify that in the file itself, by putting
> an
> <?xml?> construct at the top of the file.  e.g.
>
>     <?xml version="1.0" encoding="ISO-8859-1"?>
>     ... rest of XML file follows ...
>
> > Also, while writing such a special character to the file, I get an
> error.
> > >> document.writexml (file (myFile, "w"), encoding='utf-8')
> >
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
> position
> > 16: ordinal not in range(128)
>
> not sure; maybe you've added byte strings (encoded strings instead of
> Unicode
> strings) to the document, or maybe there's a bug in minidom.  What happens
> if
> you remove the encoding argument?  If you still get the same error after
> doing
> that, make sure you use only Unicode strings when you add stuff to the
> document.
>
> hope this helps!
>
> </F>
>
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
I've specified utf-8 in the xml header
<?xml version="1.0" encoding="utf-8"?>

In writexml (), even without specifying the encoding, I get the same error.
That't why I tried manually specifying the encoding.

But I managed to find a workaround.
I got some clues from http://evanjones.ca/python-utf8.html

According to the site,

import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file

should return me a unicode string. But I still get an error.
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 407-410:
invalid data

I can't figure out why! Why can't it parse ö character as unicode?

Anyway,
>> f = codecs.open ("c:/test.txt", "r", "latin-1")
>> dom = minidom.parseString (codecs.encode (f.read(), "utf-8"))

works. But then I dunno if this will work for chinese or other unicode
characters.
How do I make my code read unicode files?

Also, while writing the xml file, I now use codecs.open ()
>> document.writexml (codecs.open (mFile, "w", "utf-8"), encoding="utf-8")

IMHO, writexml should be taking care of this, instead of me having to use
codecs. I guess this is a bug.

--
Regards,
Abhimanyu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20060307/572e979c/attachment.html>