[Python-Dev] XML codec?

Thu Nov 8 22:01:24 CET 2007

Martin v. Löwis wrote:

>> Then how about the suggested "xml-auto-detect"?
> 
> That is better.

OK.

>>> Then, I'd claim that the problem that the codec solves doesn't really
>>> exist. IOW, most XML parsers implement the auto-detection of encodings,
>>> anyway, and this is where architecturally this functionality belongs.
>> But not all XML parsers support all encodings. The XML codec makes it
>> trivial to add this support to an existing parser.
> 
> I would like to question this claim. Can you give an example of a parser
> that doesn't support a specific encoding

It seems that e.g. expat doesn't support UTF-32:

from xml.parsers import expat

p = expat.ParserCreate()
e = "utf-32"
s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e)
p.Parse(s, True)

This fails with:

Traceback (most recent call last):
   File "gurk.py", line 6, in <module>
     p.Parse(s, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, 
column 1

Replace "utf-32" with "utf-16" and the problem goes away.

> and where adding such a codec
> solves that problem?
> 
> In particular, why would that parser know how to process Python Unicode
> strings?

It doesn't have to. You can use an XML encoder to reencode the unicode 
string into bytes (forcing an encoding that the parser knows):

import codecs
from xml.parsers import expat

ci = codecs.lookup("xml-auto-detect")
p = expat.ParserCreate()
e = "utf-32"
s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e)
s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0]
p.Parse(s, True)

>> Furthermore encoding-detection might be part of the responsibility of
>> the XML parser, but this decoding phase is totally distinct from the
>> parsing phase, so why not put the decoding into a common library?
> 
> I would not object to that - just to expose it as a codec. Adding it
> to the XML library is fine, IMO.

But it does make sense as a codec. The decoding phase of an XML parser 
has to turn a byte stream into a unicode stream. That's the job of a codec.

>> There's a (currently undocumented) codecs.detect_xml_encoding() in the
>> patch. We could document this function and make it public. But if
>> there's no codec that uses it, this function IMHO doesn't belong in the
>> codecs module. Should this function be available from xml/__init__.py or
>> should be put it into something like xml/utils.py?
> 
> Either - or.

OK, so should I put the C code into a _xml module?

>>> Finally, I think the codec is incorrect. When saving XML to a file
>>> (e.g. in a text editor), there should rarely be encoding errors, since
>>> one could use character references in many cases.
>> This requires some intelligent fiddling with the errors attribute of the
>> encoder.
> 
> Much more than that, I think - you cannot use a character reference
> in an XML Name. So the codec would have to parse the output stream
> to know whether or not a character reference could be used.

That's what I meant with "intelligent" fiddling. But I agree this is way 
beyond what a text editor should do. AFAIK it is way beyond what 
existing text editors do. However using the XML codec would at least 
guarantee that the encoding specified in the XML declaration and the 
encoding used for encoding the file stay consistent.

>> Correct, but as long as Python doesn't have an EBCDIC codec, that won't
>> help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
>> rather simple though.
> 
> But it does! cp037 is EBCDIC, and supported by Python.

I didn't know that. I'm going to update the patch.

Servus,
    Walter