[XML-SIG] Processing xml files with ISO 8859-1 chars

Dan Gunter dkgunter@lbl.gov
Wed, 07 Nov 2001 07:02:35 -0800


The simple answer is that the XML parser is illiterate. Since there are=20
no bit patterns that are illegal in UTF-8, I don't see how the parser=20
could know that the chosen encoding produced, from the user's=20
perspective, garbage. The pretty-printer, on the other hand, knows the=20
difference between printable and non-printable characters and can thus=20
complain.

Dan

Thomas B. Passin wrote:

> It seems that this xml file should caused an exception, since it is not
> well-formed:  the actual encoding does not match the presumed encoding
> (namely, utf-8).  The fact that the parse partially succeeded is distur=
bing.
>=20
> I tried this example myself.  I am running pyxml 6.6 on Windows2000.  I=
 did
> get an exception, but it was from the pretty-printer, not the parser.
> Adding an xml declaration declaring the actual iso-8859-1 encoding did =
in
> fact allow the program to complete properly, as expected.
>=20
> Why didn't the parser complain?
>=20
> Cheers,
>=20
> Tom P
>=20
>=20
> [Rodrigo Senra]
>=20
>=20
>>  I don't know if I stepped in a bug or it is just my newbieness ;o)
>>  Trying to parse the file:
>>
>>------------------- pau.xml -------------------
>><note>
>>  <assunto>
>>   This line is ok.
>>   This line has characters  ISO-8859-1 with accents: Houve mudan=E7as =
nos
>>pre=E7os?
>>   Linha ok.
>>  </assunto>
>></note>
>>------------------ end of file pau.xml --------
>>
>>with the script:
>>
>>------------------ file teste.py ----------------------------
>>from xml.dom.ext.reader import Sax2
>>from xml.dom.ext import PrettyPrint
>>
>>doc =3D Sax2.FromXmlStream(open('pau.xml'))
>>PrettyPrint(doc,encoding=3D'iso-8859-1')
>>-------------------- end of teste.py script ------------
>>
>>produces:
>>
>>----------- stdout trace -------------
>><?xml version=3D'1.0' encoding=3D'iso-8859-1'?>
>><!DOCTYPE note>
>><note>
>>   <assunto>
>>   This line is ok.
>>
>>   Linha ok.
>>  </assunto>
>></note>
>>----------- end of trace -------------
>>
>>Am I doing something obviously wrong ? Should I try another parser ?
>>
>=20
>=20
>=20
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig
>=20