minidom and unicode errors

Tue Mar 7 01:04:53 EST 2006

Abhimanyu Seth wrote:

> I'm trying to parse and modify an XML document using xml.dom.minidom module
> and Python 2.4.2
>
> >> from xml.dom import minidom
> >> dom = minidom.parse ("c:/test.txt")
>
> If the xml file contains a non-ascii character, then i get a parse error.
> I have the following line in my xml file:
> <target>Exception beim Löschen des Audit-Moduls aufgetreten. Exception Stack
> lautet: %1.</target>
> ExpatError: not well-formed (invalid token): line 8, column 27
>
> If I remove the ö character, then it works fine. I'm guessing this has to do
> with the default encoding which is ascii. I guess i can change the encoding
> by modifying a file on my machine that the interpretter reads while loading,
> but then how do I get my program to work on different machines?

the default encoding for XML is UTF-8.  If you're using any other encoding
in your XML file, you have to specify that in the file itself, by putting an
<?xml?> construct at the top of the file.  e.g.

    <?xml version="1.0" encoding="ISO-8859-1"?>
    ... rest of XML file follows ...

> Also, while writing such a special character to the file, I get an error.
> >> document.writexml (file (myFile, "w"), encoding='utf-8')
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position
> 16: ordinal not in range(128)

not sure; maybe you've added byte strings (encoded strings instead of Unicode
strings) to the document, or maybe there's a bug in minidom.  What happens if
you remove the encoding argument?  If you still get the same error after doing
that, make sure you use only Unicode strings when you add stuff to the document.

hope this helps!

</F>