[XML-SIG] Parsing a unicode string

Mike Brown mike at skew.org
Tue Oct 5 05:57:15 CEST 2004


konrad.hinsen at laposte.net wrote:
> Is there any straightforward way to parse XML data from an existing
> Unicode string? I have found a procedure that works, but seems
> unnecessarily complicated:
> 
>       xml_data = xml_data.encode('utf-8')
>       xml_data = '\n'.join(xml_data.split('\n')[1:])
>       xml_data = '<?xml version="1.0" encoding="utf-8"?>\n' + xml_data
>       dom_tree = xml.dom.minidom.parseString(xml_data)
        
You've got the general idea -
1. encode it                    
2. notify the parser of the encoding
  
Parsers should have a way to be notified of encoding externally;
by the rules of XML parsing as defined in the XML spec, such an 
external declaration takes precedence.
  
Unfortunately, even though parsers may have the necessary APIs,
their Python wrappers may not. We didn't even add it to 4Suite
until a few months ago.
 
So you're pretty much left with the option you've gone with -
rewrite the encoding declaration in the XML itself.

The method you are using in the code above is obviously assuming
a bit more than it should-- there might not be any line breaks
or an XML declaration at all. A regular expression would be better.

-Mike


More information about the XML-SIG mailing list