[XML-SIG] Minidom bugs/questions

Sun, 4 Feb 2001 00:23:15 +0100

> I'm making my first steps into XML, so please forgive me.  

Hi Guido,

I will forgive, but I will still comment :-)

> Converting my app to use minidom was easy enough, but I found out a
> bout a bunch of differences between the two DOM implementations.  Some
> of these are fine with me (e.g. minidom doesn't preserve comments,
> doesn't prefix its output with "<?xml version="1.0" ?>" when writing
> XML output, minidom returns Unicode strings even for ASCII input).

Actually, input in XML is always Unicode. If no encoding is specified
in the document, it is treated as UTF-8. If an encoding is specified,
DOM implementations shall transform it into Unicode before giving it
to the user.

It is only that older Python versions did not support Unicode; I guess
that's the reason why the Zope one does not comply here.

> 1. The other DOM has a hasAttributes() predicate; minidom is missing
>    this and I have to use the more expensive form "if node.attributes".

Right; that's a bug in minidom: hasAttributes was introduced in "DOM
Level 2".

The original idea of minidom was that it should be "minimal"; clearly
that has not worked out, so we probably should review it carefully to
achieve completeness (with respect to "DOM 2 Core").

> 2. In minidom, Element.getAttribute() and .getAttributeNS() raise
>    KeyError for a non-existing attribute; in the othe DOM, they return
>    "".  (Personally, I'd prefer KeyError or perhaps None, but according
>    to Fred, the DOM standard requires "".

Right. To get the KeyError, use .attributes['attrname'], which is a
Python extension to the DOM.

> 3. Note that getAttributeNode() correctly returns None of the attribute
>    doesn't exist, but getAttributeNodeNS() looks like it will raise
>    KeyError too!

Yes, that's yet another error.

> 4. In minidom, createDocument() leaves doc.documentElement set to None;
>    in the other DOM, doc.documentElement is initialized to an Element
>    node created from the second argument to createDocument().  (Again,
>    according to Fred, the DOM standard requires the latter.)

That was a surprise to me. After reading the spec and a number of
implementations, I think the requirement is much stronger: You MUST
pass a qualifiedName, only the namespaceURI and the doctype are
optional. 

So your patch is incomplete in this respect; you also need to correct
pulldom to pass meaningful content (with your patch, you could get two
document elements).

It appears to be a common trick to allow null in createDocument, so
that the first element found during parsing can be introduced with
appendChild, but that appears to be non-conforming (somebody please
correct me if it is).

I could try to come up with a separate patch for that issue.

> 5. When writing XML output from a DOM tree that uses namespace
>    attributes, minidom doesn't insert the proper "xmlns:<tag>=<URI>"
>    attributes.  The other DOM gets this right.  (This is a bit tricky
>    to do, although I've figured a good way to do it which I'll gladly
>    donate to minidom if it's deemed useful.)

Yes, that is certainly desirable; minidom should support namespaces
fully.

> 
> 6. When writing XML output from a DOM tree that has a default
>    namespace, minidom writes <:tag>...</:tag> instead of
>    <tag>...</tag> like the other DOM, and like I would have expected.

Certainly a bug. When writing out namespace declarations, dealing with
default default namespace is really tricky (e.g. when a tree that had
a default namespace is extended with an element with no namespace).

> 7. I noticed that minidom's __getattr__ special-cases requests for an
>    attribute whose name begins with _get_, and makes up a lambda on the
>    fly.  This suggests that the caller is using for _get_foo() where
>    there is no such method, but there is a foo attribute.  Since
>    _get_foo() is a detail of the implementation (I hope)

No, its actually not. The DOM is defined in terms of CORBA IDL,
unfortunately with a massive use of attributes. Attributes, in CORBA,
map to two functions, _get_<attr> and _set_<attr>; this is also how
the IDL language mapping for Python works.

So the canonical way of using DOM in Python would be to use the _get_
and _set_ methods; a number of Python DOM implementations support that
- although the now-official Python DOM mapping marks these methods as
optional.

Some people might be using this interface, e.g. when they access a DOM
both locally and remotely. Some may use it because they consider
accessor functions cleaner than attribute access. Since it does not
cost anything to have that feature, I'd leave it.

> Hare are proposed patches for items 1, 2, 3, 4 and 6 above (fixing 6
> turns out to require a patch to pulldom.py).  

The ones for 1,2,3 and 6 look fine; for the one to 4, see my comments
above.

> 7 is a trivial patch but I expect there's a reason (in which case a
> comment would be a nice idea :-).

It is elaborated at

http://python.sourceforge.net/devel-docs/lib/dom-accessor-methods.html

So referring the reader to the documentation may be appropriate.

Regards,
Martin