XML parsing per record

Thu Apr 21 09:40:27 EDT 2005

Don't assume that just because you have a 2.4G XML file that you have
2.4G of data.  Looking at these verbose tags, plus the fact that the
XML is pretty-printed (all those leading spaces - not even tabs! - add
up), I'm guessing you only have about 5-10% actual data, and the rest
is just XML tagging/untagging and spaces.  (For example, 373 characters
used to represent a date/time - this is a sin!)

As XML goes, this looks pretty dead easy to parse with non-XML parser
means.  It looks like all of your leaf nodes open and close on the same
line, which would be easy to extract with regexp's or pyparsing.
Especially since you mention "I only need some of the informtion", you
don't even have to build a full document tree representation.  SAX
parsers would also be good, since you could only trigger on the
matching subset of tags that you are really interested in.  Lastly, you
could even try a pyparsing approach.  I usually don't recommend
pyparsing for XML since there are already many good XML-targeted tools
out there, but it is very easy to throw together something in pyparsing
that extracts, say, all of the <object-id_id> entries, or all of the
<gene-source> structures.  What is the subset of information you are
looking to extract?

-- Paul