Problem with processing XML

Paul Boddie paul at boddie.org.uk
Tue Jan 22 11:48:40 EST 2008


On 22 Jan, 15:11, John Carlyle-Clarke <j... at nowhere.org> wrote:
>
> I wrote some code that works on my Linux box using xml.dom.minidom, but
> it will not run on the windows box that I really need it on.  Python
> 2.5.1 on both.
>
> On the windows machine, it's a clean install of the Python .msi from
> python.org.  The linux box is Ubuntu 7.10, which has some Python XML
> packages installed which can't easily be removed (namely  python-libxml2
> and python-xml).

I don't think you're straying into libxml2 or PyXML territory here...

> I have boiled the code down to its simplest form which shows the problem:-
>
> import xml.dom.minidom
> import sys
>
> input_file = sys.argv[1];
> output_file = sys.argv[2];
>
> doc = xml.dom.minidom.parse(input_file)
> file = open(output_file, "w")

On Windows, shouldn't this be the following...?

  file = open(output_file, "wb")

> doc.writexml(file)
>
> The error is:-
>
> $ python test2.py input2.xml output.xml
> Traceback (most recent call last):
>    File "test2.py", line 9, in <module>
>      doc.writexml(file)
>    File "c:\Python25\lib\xml\dom\minidom.py", line 1744, in writexml
>      node.writexml(writer, indent, addindent, newl)
>    File "c:\Python25\lib\xml\dom\minidom.py", line 814, in writexml
>      node.writexml(writer,indent+addindent,addindent,newl)
>    File "c:\Python25\lib\xml\dom\minidom.py", line 809, in writexml
>      _write_data(writer, attrs[a_name].value)
>    File "c:\Python25\lib\xml\dom\minidom.py", line 299, in _write_data
>      data = data.replace("&", "&").replace("<", "<")
> AttributeError: 'NoneType' object has no attribute 'replace'
>
> As I said, this code runs fine on the Ubuntu box.  If I could work out
> why the code runs on this box, that would help because then I call set
> up the windows box the same way.

If I encountered the same issue, I'd have to inspect the goings-on
inside minidom, possibly using judicious trace statements in the
minidom.py file. Either way, the above looks like an attribute node
produces a value of None rather than any kind of character string.

> The input file contains an <xsd:schema> block which is what actually
> causes the problem.  If you remove that node and subnodes, it works
> fine.  For a while at least, you can view the input file at
> http://rafb.net/p/5R1JlW12.html

The horror! ;-)

> Someone suggested that I should try xml.etree.ElementTree, however
> writing the same type of simple code to import and then write the file
> mangles the xsd:schema stuff because ElementTree does not understand
> namespaces.

I'll leave this to others: I don't use ElementTree.

> By the way, is pyxml a live project or not?  Should it still be used?
> It's odd that if you go to http://www.python.org/and click the link
> "Using python for..." XML, it leads you to http://pyxml.sourceforge.net/topics/
>
> If you then follow the download links to
> http://sourceforge.net/project/showfiles.php?group_id=6473 you see that
> the latest file is 2004, and there are no versions for newer pythons.
> It also says "PyXML is no longer maintained".  Shouldn't the link be
> removed from python.org?

The XML situation in Python's standard library is controversial and
can be probably inaccurately summarised by the following chronology:

 1. XML is born, various efforts start up (see the qp_xml and xmllib
    modules).
 2. Various people organise themselves, contributing software to the
    PyXML project (4Suite, xmlproc).
 3. The XML backlash begins: we should all apparently be using stuff
    like YAML (but don't worry if you haven't heard of it).
 4. ElementTree is released, people tell you that you shouldn't be
    using SAX or DOM any more, "pull" parsers are all the rage
    (although proponents overlook the presence of xml.dom.pulldom in
    the Python standard library).
 5. ElementTree enters the standard library as xml.etree; PyXML falls
    into apparent disuse (see remarks about SAX and DOM above).

I think I looked seriously at wrapping libxml2 (with libxml2dom [1])
when I experienced issues with both PyXML and 4Suite when used
together with mod_python, since each project used its own Expat
libraries and the resulting mis-linked software produced very bizarre
results. Moreover, only cDomlette from 4Suite seemed remotely fast,
and yet did not seem to be an adequate replacement for the usual PyXML
functionality.

People will, of course, tell you that you shouldn't use a DOM for
anything and that the "consensus" is to use ElementTree or lxml (see
above), but I can't help feeling that this has a damaging effect on
the XML situation for Python: some newcomers would actually benefit
from the traditional APIs, may already be familiar with them from
other contexts, and may consider Python lacking if the support for
them is in apparent decay. It requires a degree of motivation to
actually attempt to maintain software providing such APIs (which was
my solution to the problem), but if someone isn't totally bound to
Python then they might easily start looking at other languages and
tools in order to get the job done.

Meanwhile, here are some resources:

http://wiki.python.org/moin/PythonXml

Paul

[1] http://www.python.org/pypi/libxml2dom



More information about the Python-list mailing list