[XML-SIG] Q: minidom and iso-8859-1

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 13 Sep 2000 13:03:14 +0200


> Oops, I think I begin to understand what is going on.  The UTF-8
> indeed prints the right result, it was just not the result I (encoding
> newbie) expected.

Hi Jacco,

This is what I suspected. I'm surprised though as to what your
expectation was.

> I think I just asked myself the wrong question (how was the original
> XML encoded) while I should have asked myself in what encoding I
> want to have the output in.

I thought you'd expect that UTF-8 would reproduce latin-1 characters
in a single byte - which of course cannot work, as latin-1 would then
consume all possible bit combinations. In any case, that seems to be
resolved - now the next question is: How do you get the original
encoding of the document.

It appears that the DOM itself does not provide any mechanism for
that. It may be that the reader passes this information to the DOM
builder, so you may need to hook into the parser. However, it also
appears that SAX does not generate an event for the <?xml header, so
you could only use a specific parser with some extended interface.

I know xmllib invokes handle_xml for that; I don't know whether expat
gives access to that information, it appears as if the default handler
would be invoked when <?xml is seen, with the encoding as a parameter.

Regards,
Martin