[XML-SIG] xmlproc bug ?
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Thu, 6 Sep 2001 16:46:34 +0200
> Is the following behaviour a well known feature or a bug (or me deeply
> misunderstanding SAX)? It looks like xmlproc's Sax2 driver won't produce
> UTF-8 encoded text when parsing a iso-8859-1 encoded file.
>
> The attached file demonstrates this. I tested it on python 1.5.2 and
> 2.1.1, using PyXML 0.6.6.
>
> I'll register this in the bugtracker if it turns out to be a bug.
I think I got the full story now. xmlproc, in principle, was capable
of performing charset conversions itself. However,
1. To do so, one had to call set_data_charset on the parser. The SAX
driver doesn't, you'll have to add
parser.set_data_charset("utf-8")
into drv_xmlproc.XmlprocDriver.parse.
2. Once you do so, xmlproc will try to perform conversions. In 0.6.6,
this fails, because of the comment in xmlproc.charconv:
# UTF-8 stuff disabled due to total lack of speed
If you enable the lines below that comment, the parser will attempt
charset conversion, but it will indeed slow down significantly.
3. The parser instantiates the conversion function only after it has
seen the encoding= attribute. In your example, it has already
converted the first chunk of data using the old converter (identity
conversion), and it fails to convert rest of this chunk using the
new converter. As a result, it will still pass those data as
Latin-1 to the application.
So in short, it doesn't work at all.
4. In the CVS, the Unicode API is used where available. For that, care
was taken to convert the rest of the data once the encoding has
been detected. That's why it 'works' with Py 2.1 and the CVS xmlproc.
5. In the process of integrating Unicode, it was considered pointless
to allow accented characters in elements if Unicode is not
available. Those accented characters would have been good only if
the input is Latin-1, and only if conversion to UTF-8 is not
performed. They also constitute only a small subset of the legal
NCName characters. As a result, names are now restricted to ASCII
letters in CVS PyXML. In turn, your document is rejected with the
CVS PyXML on Py 1.5.2.
Hope this clarifies it. If you think anything should be done about it,
I can assist in drafting a patch - although I won't attempt to create
one myself.
Regards,
Martin