[XML-SIG] How do I write an efficient parser?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sun, 23 Sep 2001 11:57:07 +0200


> My project has been using the SAX parser under 1.5.2 and 2.X. The XML
> file contains genealogy information. When I have about 2000 people in
> the database, the expat parser reads it in a reasonable amount of time -
> a few seconds, not too long for the user.
> 
> However, when I starts reaching the 6000-7000 entries, it can take up to
> a minute or longer. At 50000, it takes several minutes, which is just
> unacceptable.

This suggests linear complexity, right? This is how it should be,
parsers do have linear complexity by nature.

> Is there a way to write a more efficient parser without having to resort
> to C?

You can't do better than linear complexity. However, you can try to
speed up the processing of every event. I don't know your application;
most likely, you can gain speed by tweaking it - I recommend to run
the application in a profiler and find out what it really is that
consumes the time.

If you want to reduce the processing done in the XML libraries, you
can use the raw expat API instead of using SAX. That saves you the
indirections done in expatreader, and it saves you reports of events
on which your handlers won't act.

If that still is too slow, you can try using sgmlop instead of
expat. sgmlop has a number of limitations, but they may not show up in
your application.

BTW, which Python version did you use for the timings?

Regards,
Martin