[XML-SIG] unicode, latin-1 and DOM...

Uche Ogbuji uche.ogbuji@fourthought.com
Thu, 28 Jun 2001 07:19:00 -0600


> Hello everyone,
> =

> I'm struggling with unicode and stuff (so expect some mails in the comi=
ng
> days). Here's the first one. I'm aware that the XML document being pars=
ed
> in not correct (no encoding header), bug I'm surprised by the resut I g=
et: =

> =

> >>> from xml.dom.ext.reader import Sax2
> >>> d =3D Sax2.FromXml('<d>=E9t=E9</d>')
> >>> from xml.dom.ext import PrettyPrint
> >>> PrettyPrint(d)
> <?xml version=3D'1.0' encoding=3D'UTF-8'?>
> <!DOCTYPE d>
> <d/>
> >>> d.documentElement
> <Element Node at 81b14c4: Name=3D'd' with 0 attributes and 0 children>
> =

> I'm using python 2.1 the cvs version of PyXML with 4Suite 0.11.1b2. =

> =

> I would have expected a parse error when the latin-1 characters where
> encountered, and not a silent failure to create the Text node.

The parser is probably blowing up, and 4DOM's improperly masking the erro=
r.

Or maybe not.  pDomlette shows the same problem

>>> from Ft.Lib.pDomlette import PyExpatReader
>>> reader =3D PyExpatReader()            =

>>> doc =3D reader.fromString('<d>=E9t=E9</d>')
>>> doc.documentElement
<Domlette Element Node at 81e4c64: name=3D'd' with 0 attributes and 0 chi=
ldren>
>>> =


I'll have a quick look.

Note: you shouldn't be using the deprecated Sax2 "From*" functions.


-- =

Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com =

4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
XML strategy, XML tools (http://4Suite.org), knowledge management