[XML-SIG] I am stuck: 4DOM / utf-8

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 7 Aug 2001 07:35:23 +0200


> I converted a XML file encoded in utf-8 into a DOM structure (PyXML-0.6.5).

How exactly did you do that? Did you use one of the PyXML parsers? If
so, which one?

> Then I try to split the document into smaller subparts and store the
> parts into a database, and display them with tkinter.

I suppose you use DOM manipulation functions for that? Did you pass
strings to those functions, or Unicode objects? If strings, did they
contain non-ASCII characters? If so, it would explain the problem: You
always must put Unicode objects into DOM trees; the only exception are
plain ASCII strings (i.e. no accented or otherwise funny characters).

> For extraction of XML from the DOM, the best function I found was
> PrettyPrint, which unfortunately does not support direct assining to
> a string. So I followed the examples utilizing the StringIO
> library. However, every time I try to access the stream, I get an
> error (see below).  What should I do?

If you cannot figure out the problem, it would be helpful if you'd
change your code to

        stream = StringIO()
        ext.PrettyPrint(value, stream=stream)
        print repr(stream.buflist)
        stream.seek(0)
        text = load(stream)
        stream.close()

and post the contents of the buflist, together with the XML file that
was the input. If it is a large file, or if you don't want to post it
to the general public, I'd appreciate to get a private message.

I'm very must surprised that Unicode objects end up in the StringIO,
but that may be a bug in PyXML - in principle, everything ought to be
UTF-8 encoded by PrettyPrint.

Regards,
Martin