[I18n-sig] XML and UTF-16

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Fri, 1 Jun 2001 14:59:37 +0200


> > > Yes, I think this would be a good idea. I would use something along
> > > the lines of:
> > 
> > Please have a look at
> > xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost
> > follows the procedure in the XML recommendation, except that it does
> > not expect "unusual" byte orders (2134, 3412), and that it does not
> > detect EBCDIC.
> 
> I don't have a file EntityParser in the xmlproc subdir... is
> that in CVS somewhere ?

Oops, missed on level of indirection:

xml.parsers.xmlproc.xmlutils.EntityParser.autodetect_encoding

And yes, the function is only in the CVS, not in a released version
(yet).

> Could we maybe have the function autodetect_encoding at
> some higher level in PyXML ?! This is a very basic API and
> doesn't only apply to xmlproc.

We might (contributions are welcome). However, such a function would
not necessarily be usable for xmlproc: xmlproc deals with reading data
in small chunks, expecting that information may be broken at arbitrary
boundaries. For example, would you expect that the autodetection
function looks for the encoding= attribute? That may not be included
in the first fragment of data.

> I also think that it would be worthwhile adding a similar
> API to codecs.py which takes the magic ('<?xml' in this case)
> as argument and then tries to determine whether the input
> data is an ASCII superset, UTF-8 or UTF-16/32.

I don't think so. Doing the XML autodetection is not terribly
complicated, and rarely needs to be done - you'd normally pass the
byte stream to an XML parser, so you would not need to care about the
encoding.

As for XML and encodings, having a convenient mechanism to extend
existing codecs to encode unknown characters as character entities is
much more important, IMO, since that is very difficult to achieve with
the existing API.

Regards,
Martin