xml.dom.minidom character encoding

Stefan Behnel stefan_ml at behnel.de
Thu Apr 22 01:48:00 EDT 2010


C. Benson Manica, 21.04.2010 19:19:
> I have the following simple script running on 2.5.2 on a machine where
> the default character encoding is "ascii":
>
> #!/usr/bin/env python
> #coding: utf-8
>
> import xml.dom.minidom
> import codecs
>
> str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
> \"ó\"/></elements>"
> doc=xml.dom.minidom.parseString( str )
> xml=doc.toxml( encoding="utf-8" )
> file=codecs.open( "foo.xml", "w", "utf-8" )
> file.write( xml )
> file.close()

You are trying to re-encode an already encoded output string here. 
toxml(encoding="utf-8") returns a byte string. If you pass that into an 
encoding file object (as returned by codecs.open()), which expects unicode 
input, it will fail to re-encode the already encoded string. This gives a 
bizarre error in Python 2.x and an understandable one in Python 3.

So the right solution is to let toxml() do the encoding and drop the use of 
codecs.open() in favour of

     f = open("foo.xml", "wb")

(mind the 'b' in the file mode, which stands for 'bytes' or 'binary')

Stefan




More information about the Python-list mailing list