[Python-Dev] XML codec?

"Martin v. Löwis" martin at v.loewis.de
Thu Nov 8 19:39:26 CET 2007


> Then how about the suggested "xml-auto-detect"?

That is better.

>> Then, I'd claim that the problem that the codec solves doesn't really
>> exist. IOW, most XML parsers implement the auto-detection of encodings,
>> anyway, and this is where architecturally this functionality belongs.
> 
> But not all XML parsers support all encodings. The XML codec makes it
> trivial to add this support to an existing parser.

I would like to question this claim. Can you give an example of a parser
that doesn't support a specific encoding and where adding such a codec
solves that problem?

In particular, why would that parser know how to process Python Unicode
strings?

> Furthermore encoding-detection might be part of the responsibility of
> the XML parser, but this decoding phase is totally distinct from the
> parsing phase, so why not put the decoding into a common library?

I would not object to that - just to expose it as a codec. Adding it
to the XML library is fine, IMO.

> There's a (currently undocumented) codecs.detect_xml_encoding() in the
> patch. We could document this function and make it public. But if
> there's no codec that uses it, this function IMHO doesn't belong in the
> codecs module. Should this function be available from xml/__init__.py or
> should be put it into something like xml/utils.py?

Either - or.

>> Finally, I think the codec is incorrect. When saving XML to a file
>> (e.g. in a text editor), there should rarely be encoding errors, since
>> one could use character references in many cases.
> 
> This requires some intelligent fiddling with the errors attribute of the
> encoder.

Much more than that, I think - you cannot use a character reference
in an XML Name. So the codec would have to parse the output stream
to know whether or not a character reference could be used.

> Correct, but as long as Python doesn't have an EBCDIC codec, that won't
> help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
> rather simple though.

But it does! cp037 is EBCDIC, and supported by Python.

Regards,
Martin


More information about the Python-Dev mailing list