How to search HUGE XML with DOM?

Fri Mar 31 08:50:31 EST 2006

Diez B. Roggisch wrote:
> > the xml.dom.minidom object is too slow when parsing such a big XML file
> > to a DOM object. while pulldom should spend  quite a long time going
> > through the whole database file. How to enhance the searching speed?
> > Are there existing solution or algorithm? Thank you for your
> > suggetion...
>
> I've told you that before, and I tell you again: RDBMS is the way to go.

We've lost some context from the original post that may be relevant
here, but if populating what the original questioner calls "the
database" is an infrequent operation, then an RDBMS probably is the way
to go, in general. On the other hand, if a lot of parsing has to happen
in order to perform a search, such parsing would probably incur a lot
of overhead from SQL inserts that wouldn't be particularly desirable.

> There might be XML-parsers that work faster - I suppose cElementTree can
> gain you some speed - but ultimately the problems are inherent in the
> representation as DOM: no type-information, no indices, no nothing. Just a
> huge pile of nodes in memory.

Well, I would hope that W3C DOM operations like getElementById would be
supported by some index in the implementation: that would make some of
the searches mentioned by the questioner fairly rapid, given enough
memory.

> So all searches are linear in the number of nodes. Of course you might be
> able to create indices yourself, even devise a clever scheme to make using
> them as declarative as possible. But that would in the end mean nothing but
> re-creating RDBMS technology - why do that, if it's already there?

I agree that careful usage of RDBMS technology would solve the general
problems of searching large amounts of data, but the stated queries
should involve indexes and be fairly quick.

Paul