[XML-SIG] Processing xml files with ISO 8859-1 chars

Morus Walter morus.walter@tanto-xipolis.de
Thu, 8 Nov 2001 09:05:24 +0100


Lars Marius Garshol writes:
>=20
> * Dan Gunter
> |=20
> | Of course, checking an _arbitrary_ encoding for correctness seems
> | like a real burden on the parser, but maybe UTF-8 is so common it
> | should be checked.
>=20
> All encodings should be checked for correctness, although not all of
> them can be. Most single-byte encodings (like the ISO 8859-x series)
> have no illegal bit sequences, and so cannot be checked with anything=

> short of full-scale AI. Most multi-byte encodings, however, have
> illegal bit sequences and converters can and should check these for
> correctness. This is really no different from or less important than
> verifying syntactical correctness.
>=20
Doesn't handling non standard (standard with respect to xml) encodings
imply conversion to unicode somehow?
E.g. inn XML names are further restricted to specific unicode character=
s...

I mean even ASCII contains characters that are not allowed in XML docum=
ents
(such as 0x00, 0x01...). The same aplies to ISO 8859-x (since they are
ascii based). Apart from that, any byte within [\x00-\x7F\xA0-\xFF] is=20=

valid ISO 8859-x so checking is rather easy than requiring AI.
(There's no requirement that the content makes sense ;-))

Of course a parser might be sloppy on some of these restrictions due to=

performance considerations. However it should be clear, that it fails t=
o
be a conforming parser then.

greetings
=09Morus
--=20
Th. Morus Walter =B7 Manager Content & Data Development
xipolis.net GmbH & Co. KG =B7 Schellingstrasse 35 =B7 80799 M=FCnchen