Which XML parser under Python can handle Unicode?

Martin v. Löwis loewis at informatik.hu-berlin.de
Tue Sep 3 06:56:29 EDT 2002


jhorneman at pobox.com (Jurie Horneman) writes:

> Could someone please recommend an XML parser under Python which can
> handle Unicode? 

The "standard" parser of Python 2.x, ie. xml.parser.expat, supports
Unicode quite well, as do all libraries on top of this parser
(xml.sax, xml.dom.minidom). Since PyXML 0.7, xmlproc supports Unicode as well.

> Specifically, most European languages and Asian
> languages such as Chinese (traditional, simplified), Japanese, Korean
> and Thai? (Please don't make me look up the actual codetables.)

All of those languages are supported by all Python XML parsers, in
various encodings - but best in UTF-8.

> Ideally the parser would use Python's codec system.

pyexpat/Expat has builtin codecs for UTF-8 and Latin-1, and falls back
to Python codecs - but only for byte-oriented encodings. xmlproc uses
the Python codec system.

> I'd prefer an event-based parser, although tree-based would not be a
> huge problem.

Both pyexpat and xmlproc are event-based. I'd recommend to use SAX on
top of this, which is also event-based.

Regards,
Martin




More information about the Python-list mailing list