[XML-SIG] Processing xml files with ISO 8859-1 chars

Morus Walter morus.walter@tanto-xipolis.de
Wed, 7 Nov 2001 17:22:24 +0100


Dan Gunter writes:
> The simple answer is that the XML parser is illiterate. Since there a=
re=20
> no bit patterns that are illegal in UTF-8, I don't see how the parser=
=20
> could know that the chosen encoding produced, from the user's=20
> perspective, garbage. The pretty-printer, on the other hand, knows th=
e=20
> difference between printable and non-printable characters and can thu=
s=20
> complain.
>=20
Sorry. This is wrong.
There are a lot of byte combinations that can never occur
in UTF-8. E.g. there can never be a single 8-bit character between 7-bi=
t
characters ([\x20-\x7F][\x80-\xFF][\x20-\x7F]).
So the parser could check, whether the byte stream forms valid utf-8.

greetings
=09Morus

--=20
Th. Morus Walter =B7 Manager Content & Data Development
xipolis.net GmbH & Co. KG =B7 Schellingstrasse 35 =B7 80799 M=FCnchen