minidom and unicode errors
Abhimanyu Seth
abhimanyu.seth at gmail.com
Tue Mar 7 01:21:16 EST 2006
On 3/7/06, Fredrik Lundh <fredrik at pythonware.com> wrote:
>
> Abhimanyu Seth wrote:
>
> > I'm trying to parse and modify an XML document using xml.dom.minidommodule
> > and Python 2.4.2
> >
> > >> from xml.dom import minidom
> > >> dom = minidom.parse ("c:/test.txt")
> >
> > If the xml file contains a non-ascii character, then i get a parse
> error.
> > I have the following line in my xml file:
> > <target>Exception beim Löschen des Audit-Moduls aufgetreten. Exception
> Stack
> > lautet: %1.</target>
> > ExpatError: not well-formed (invalid token): line 8, column 27
> >
> > If I remove the ö character, then it works fine. I'm guessing this has
> to do
> > with the default encoding which is ascii. I guess i can change the
> encoding
> > by modifying a file on my machine that the interpretter reads while
> loading,
> > but then how do I get my program to work on different machines?
>
> the default encoding for XML is UTF-8. If you're using any other encoding
> in your XML file, you have to specify that in the file itself, by putting
> an
> <?xml?> construct at the top of the file. e.g.
>
> <?xml version="1.0" encoding="ISO-8859-1"?>
> ... rest of XML file follows ...
>
> > Also, while writing such a special character to the file, I get an
> error.
> > >> document.writexml (file (myFile, "w"), encoding='utf-8')
> >
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
> position
> > 16: ordinal not in range(128)
>
> not sure; maybe you've added byte strings (encoded strings instead of
> Unicode
> strings) to the document, or maybe there's a bug in minidom. What happens
> if
> you remove the encoding argument? If you still get the same error after
> doing
> that, make sure you use only Unicode strings when you add stuff to the
> document.
>
> hope this helps!
>
> </F>
>
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
I've specified utf-8 in the xml header
<?xml version="1.0" encoding="utf-8"?>
In writexml (), even without specifying the encoding, I get the same error.
That't why I tried manually specifying the encoding.
But I managed to find a workaround.
I got some clues from http://evanjones.ca/python-utf8.html
According to the site,
import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
should return me a unicode string. But I still get an error.
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 407-410:
invalid data
I can't figure out why! Why can't it parse ö character as unicode?
Anyway,
>> f = codecs.open ("c:/test.txt", "r", "latin-1")
>> dom = minidom.parseString (codecs.encode (f.read(), "utf-8"))
works. But then I dunno if this will work for chinese or other unicode
characters.
How do I make my code read unicode files?
Also, while writing the xml file, I now use codecs.open ()
>> document.writexml (codecs.open (mFile, "w", "utf-8"), encoding="utf-8")
IMHO, writexml should be taking care of this, instead of me having to use
codecs. I guess this is a bug.
--
Regards,
Abhimanyu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20060307/572e979c/attachment.html>
More information about the Python-list
mailing list