[XML-SIG] Processing xml files with ISO 8859-1 chars

Lars Marius Garshol larsga@garshol.priv.no
13 Nov 2001 23:03:51 +0100


* Lars Marius Garshol
|
| All encodings should be checked for correctness, although not all of
| them can be. Most single-byte encodings (like the ISO 8859-x series)
| have no illegal bit sequences, and so cannot be checked with anything
| short of full-scale AI. Most multi-byte encodings, however, have
| illegal bit sequences and converters can and should check these for
| correctness. This is really no different from or less important than
| verifying syntactical correctness.

* Morus Walter
|
| Doesn't handling non standard (standard with respect to xml) encodings
| imply conversion to unicode somehow?

Yes, it certainly does.

| E.g. inn XML names are further restricted to specific unicode
| characters...

Yep, but that is a syntactic check, while what we were discussing
above was a check of the correctness of the character encoding.  Those
two things are done separately. The encoding check first, and the
syntactical check only afterwards.
 
| I mean even ASCII contains characters that are not allowed in XML
| documents (such as 0x00, 0x01...).

Certainly, but this is syntactic checking, and not checking of the
encoding. You can't look at the byte 0x02 coming from the input stream
and throw an error because that character is not allowed in Unicode.
The reason you can't do this is that you don't know what _character_
it is yet; you are looking at a byte, not a character.

If the encoding is VISCII, for example, then all is fine, because this
0x02 byte is really U+1EB2[1], and not the illegal U+0002 control code.

| The same aplies to ISO 8859-x (since they are ascii based). Apart
| from that, any byte within [\x00-\x7F\xA0-\xFF] is valid ISO 8859-x
| so checking is rather easy than requiring AI.  

Yes, that checking is quite easy, but now you are talking about
checking the encoding, and not syntactical checks.

| (There's no requirement that the content makes sense ;-))

No, unfortunately. :-)
 
| Of course a parser might be sloppy on some of these restrictions due
| to performance considerations. However it should be clear, that it
| fails to be a conforming parser then.

Performance considerations are IMHO not good enough as reasons to not
check these things.

--Lars M.