FromXMLString wanted.

Joonas Paalasmaa joonas at olen.to
Fri May 3 15:58:05 EDT 2002


Doru-Catalin Togea <doru-cat at ifi.uio.no> wrote in message news:<mailman.1020247682.3758.python-list at python.org>...
> Hi!
> 
> I am doing some pretty basic XML parsing using pyxml. My xml data
> (not the tags) contains non-english characters. pyxml for ActiveState
> Python 2.0 did not complain about that even when I did not provid an
> opening line in the xml file stateing the encoding used, like:
> 
> <?xml version = '1.0' encoding = 'iso-8859-1'?>
> 
> Strange, but true, and I could live with that.
> 
> I have now upgraded to ActiveState Python 2.2, pyxml 0.7, and it complains
> for the existence of non english characters, EVEN WHEN SPECIFYING THE
> ENCODING, as above! Strange again, and unfortunatlly I can not live with
> that. :-)
> 
> I thought of a hack around it, which would consist of reading in my
> xml file into a string, replacing non-english characters with their
> UNICODE encodings and parsing the (xml) string. How do I do that?
> 
> I used to get a DOM by means of:
> 
> -------------
> #from xml.dom.ext.reader.Sax import FromXmlStream
> from xml.dom.ext.reader.Sax import FromXmlFile
> from xml.dom.ext import PrettyPrint
> 
> doc = FromXmlFile(xmlFN)
> -------------
> 
> Now I need the following, or the equivalent from another package:
> 
> from xml.dom.ext.reader.Sax import FromXmlString
> 
> Or maybe there is another better way of achieving the same goal?

Why don't you just convert the string to unicode bofore feeding it to
parser.
That way you don't have to even declare the encoding in the file.
Correct encoding have to be set in sitecustomize.py in order to make
the example below work. See the Python FAQ at
http://www.python.org/cgi-bin/faqw.py?req=show&file=faq04.102.htp

>>> from xml.dom import minidom
>>> non_ascii = """<tag foo="Ä">ÄÖÅ</tag>"""
>>> print minidom.parseString(non_ascii.encode("UTF-8")).toxml()
<?xml version="1.0" ?>
<tag foo="Ä">ÄÖÅ</tag>
>>> print minidom.parseString(non_ascii)
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in ?
    print minidom.parseString(non_ascii)
  File "C:\PYTHON\lib\xml\dom\minidom.py", line 977, in parseString
    return _doparse(pulldom.parseString, args, kwargs)
  File "C:\PYTHON\lib\xml\dom\minidom.py", line 964, in _doparse
    toktype, rootNode = events.getEvent()
  File "C:\PYTHON\lib\xml\dom\pulldom.py", line 253, in getEvent
    self.parser.close()
  File "C:\PYTHON\lib\xml\sax\expatreader.py", line 117, in close
    self.feed("", isFinal = 1)
  File "C:\PYTHON\lib\xml\sax\expatreader.py", line 111, in feed
    self._err_handler.fatalError(exc)
  File "C:\PYTHON\lib\xml\sax\handler.py", line 38, in fatalError
    raise exception
SAXParseException: <unknown>:1:0: unclosed token



More information about the Python-list mailing list