[XML-SIG] xmlproc bug ?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 6 Sep 2001 16:46:34 +0200


> Is the following behaviour a well known feature or a bug (or me deeply
> misunderstanding SAX)? It looks like xmlproc's Sax2 driver won't produce
> UTF-8 encoded text when parsing a iso-8859-1 encoded file.
> 
> The attached file demonstrates this. I tested it on python 1.5.2 and
> 2.1.1, using PyXML 0.6.6. 
> 
> I'll register this in the bugtracker if it turns out to be a bug. 

I think I got the full story now. xmlproc, in principle, was capable
of performing charset conversions itself. However,

1. To do so, one had to call set_data_charset on the parser. The SAX
   driver doesn't, you'll have to add

            parser.set_data_charset("utf-8")

   into drv_xmlproc.XmlprocDriver.parse.

2. Once you do so, xmlproc will try to perform conversions. In 0.6.6,
   this fails, because of the comment in xmlproc.charconv:

# UTF-8 stuff disabled due to total lack of speed

   If you enable the lines below that comment, the parser will attempt
   charset conversion, but it will indeed slow down significantly.

3. The parser instantiates the conversion function only after it has
   seen the encoding= attribute. In your example, it has already
   converted the first chunk of data using the old converter (identity
   conversion), and it fails to convert rest of this chunk using the
   new converter. As a result, it will still pass those data as
   Latin-1 to the application.

So in short, it doesn't work at all.

4. In the CVS, the Unicode API is used where available. For that, care
   was taken to convert the rest of the data once the encoding has
   been detected. That's why it 'works' with Py 2.1 and the CVS xmlproc.

5. In the process of integrating Unicode, it was considered pointless
   to allow accented characters in elements if Unicode is not
   available. Those accented characters would have been good only if
   the input is Latin-1, and only if conversion to UTF-8 is not
   performed. They also constitute only a small subset of the legal
   NCName characters. As a result, names are now restricted to ASCII
   letters in CVS PyXML. In turn, your document is rejected with the
   CVS PyXML on Py 1.5.2.

Hope this clarifies it. If you think anything should be done about it,
I can assist in drafting a patch - although I won't attempt to create
one myself.

Regards,
Martin