[XML-SIG] Processing xml files with ISO 8859-1 chars

Martin v. Loewis martin@v.loewis.de
Thu, 8 Nov 2001 09:28:47 +0100


> Doesn't handling non standard (standard with respect to xml) encodings
> imply conversion to unicode somehow?
> E.g. inn XML names are further restricted to specific unicode characters...

That certainly implied additional well-formedness constraints on the
encoding. However, in real life, these constraints never lead to a
rejection of a document: Most users restrict themselves to ASCII in
element and attribute names, and use non-ASCII characters only in
character content (i.e. not in markup). Therefore, encoding problems
in a single-byte encoding usually won't be detected.

> I mean even ASCII contains characters that are not allowed in XML documents
> (such as 0x00, 0x01...). 

That doesn't help much. *No* encoding allows to use those bytes in XML
(except for UTF-16). So if the only error in the document is that the
parser uses the wrong encoding, then this aspect won't lead to a
problem detection, either.

> The same aplies to ISO 8859-x (since they are ascii based). Apart
> from that, any byte within [\x00-\x7F\xA0-\xFF] is valid ISO 8859-x
> so checking is rather easy than requiring AI.

How does that help? If the document was declared as iso-8859-1, but
really is iso-8859-2, we cannot detect that fact. If the document
really is KOI-8R, we cannot detect that fact. If the document really
is UTF-8, we cannot detect that fact.

About the only case that *can* be detected if the document is declared
UTF-8 (e.g. by leaving out the xml header), and it isn't.

> Of course a parser might be sloppy on some of these restrictions due to
> performance considerations. However it should be clear, that it fails to
> be a conforming parser then.

Can you give a specific example of a document that contains an error
regarding the declaration of an incorrect encoding which can and
should be detected?

Regards,
Martin