codecs, Swedish characters, and XML...don't mix? (repost)
Michael Hammill
mike at pdc.kth.se
Fri May 11 11:12:20 EDT 2001
Dear Andrew,
You were right. By breaking the line up as you suggested, I found that the
error was in the writing part, not explicitly in the dom.toxml(). I now
have a UTF-8 output XML file, but I have run into a new problem, which you
foresaw: namely, .toxml() gives me an XML header of
<?xml version="1.0" ?> however, this should be <?xml verion="1.0"
encoding="UTF-8" ?>. Additionally, I would like to put back in the
<!DOCTYPE> line that minidom stripped out. It contains a DTD I'm
validating against.
I thought I could do this by opening the UTF-8 file containing the .toxml()
output and replacing the <?xml?> line with the proper one and then adding
the <!DOCTYPE>. This seems problematic. I get no errors or tracebacks,
but no replacement. I'm using Python 2.1's re module, which I read is
unicode aware, but I'm obviously doing something wrong. Here's what I'm
doing:
f = codecs.open('outfromtoxml', 'rb', 'UTF-8')
g = codecs.open('new', 'wb', 'UTF-8')
file_string = f.read()
f.close()
bad_xml_pi = u'<?xml version="1.0" ?>'
good_xml_pi = u'<?xml version="1.0" encoding="UTF-8" ?>'
good_doctype = u'<!DOCTYPE ...... I'll spare you ...>'
(new_result, n) = re.subn(bad_xml_pi, good_xml_pi + good_doctype, file_string)
g.write(file_string)
g.close()
I get the output without the hoped-for change. I tried using .readlines
instead of .read, but oddly got only a null list. From what I read, it
appears .readlines probably can't interpret line breaks in unicode
files. Shouldn't .read and re work?
Alternatively, I can see how to add comments and processing instructions
using dom.minidom, but I see no way to do the two replacements above in
dom.minidom. Any ideas?
Thank you for your kind help!
Best regards,
Mike
More information about the Python-list
mailing list