[Python-Dev] XML codec?

Mon Nov 12 10:39:21 CET 2007

Martin v. Löwis wrote:
>> I don't know. Is an XML document ill-formed if it doesn't contain an
>> XML declaration, is not in UTF-8 or UTF-8, but there's external
>> encoding info?
> 
> If there is external encoding info, matching the actual encoding,
> it would be well-formed. Of course, preserving that information would
> be up to the application.

OK. When the application passes an encoding to the decoder this is
supposed to be the external encoding info, so for the decoder it makes
sense to assume that the encoding passed to the encoder is the external
encoding info and will be transmitted along with the encoded bytes.

>> This looks good. Now we would have to extent the code to detect and
>> replace the encoding in the XML declaration too.
> 
> I'm still opposed to making this a codec. Right - for a pure Python
> solution, the processing of the XML declaration would still need to
> be implemented.
> 
>>> I think there could be a much simpler routine to have the same 
>>> effect. - if it's less than 4 bytes, answer "need more data".
>> Can there be an XML document that is less then 4 bytes? I guess not.
> 
> No, the smallest document has exactly 4 characters (e.g. "<f/>").
> However, external entities may be smaller, such as "x".
> 
>> But anyway: would a Python implementation of these two functions
>> (detect_encoding()/fix_encoding()) be accepted?
> 
> I could agree to a Python implementation of this algorithm as long
> as it's not packaged as a codec.

I still can't understand your objection to a codec. What's the
difference between UTF-16 decoding and XML decoding? In fact PEP 263
IMHO does specify how to decode Python source, so in theory it could be
a codec (in practice this probably wouldn't work because of
bootstrapping problems).

Servus,
   Walter