[XML-SIG] Processing xml files with ISO 8859-1 chars

Thu, 8 Nov 2001 10:22:13 +0100

Martin v. Loewis writes:

> > I mean even ASCII contains characters that are not allowed in XML documents
> > (such as 0x00, 0x01...). 
> 
> That doesn't help much. *No* encoding allows to use those bytes in XML
> (except for UTF-16). So if the only error in the document is that the
> parser uses the wrong encoding, then this aspect won't lead to a
> problem detection, either.
> 
Sorry I don't know about all encodings. I don't think that there is a 
principal problem to define encodings that use x00 for 'a'.

> > The same aplies to ISO 8859-x (since they are ascii based). Apart
> > from that, any byte within [\x00-\x7F\xA0-\xFF] is valid ISO 8859-x
> > so checking is rather easy than requiring AI.
> 
> How does that help? If the document was declared as iso-8859-1, but
> really is iso-8859-2, we cannot detect that fact. If the document
> really is KOI-8R, we cannot detect that fact. If the document really
> is UTF-8, we cannot detect that fact.
> 
> About the only case that *can* be detected if the document is declared
> UTF-8 (e.g. by leaving out the xml header), and it isn't.
> 
> > Of course a parser might be sloppy on some of these restrictions due to
> > performance considerations. However it should be clear, that it fails to
> > be a conforming parser then.
> 
> Can you give a specific example of a document that contains an error
> regarding the declaration of an incorrect encoding which can and
> should be detected?
> 
I would speak of an encoding error if the content of a xml text is 
erroneous with respect to the provided encoding info.
So
<?xml version="1.0" encoding="iso-8859-1"?>
<bla>\129</bla>
(where \... stands for the byte with decimal number ...)
is incorrect, since \129 is not defined in iso-8859-1.

Of course you cannot tell if a text in iso-latin1 is said to be
encoded in iso-latin2 since they are formally equivalent (and you will
output garbage if you convert that to unicode).
To me encoding checking is a formal check and if there are formally
equivalent encodings there will be no difference.
If Lars Marius and you are talking about deciding if the encoding of a text
and the declaration of the encoding match, I agree that you need AI
for that.
But that does not mean that you cannot check anything.

greetings
	Morus