Trying to parse a HUGE(1gb) xml file

Mon Dec 20 16:37:48 EST 2010

On 2010-12-20, spaceman-spiff <ashish.makani at gmail.com> wrote:
> 0. Goal :I am looking for a specific element..there are several 10s/100s
> occurrences of that element in the 1gb xml file.  The contents of the xml,
> is just a dump of config parameters from a packet switch( although imho,
> the contents of the xml dont matter)

Then you need:
	1. To detect whenever you move inside of the type element you are
		seeking and whenever you move out of it.  As long as these
		elements cannot be nested inside of each other, this is an
		easy binary task.  If they can be nested, then you will
		need to maintain some kind of level count or recursively
		decompose each level.

	2. Once you have obtained a complete element (from its start tag to
		its end tag) you will need to test whether you have the
		single correct element that you are looking for.

Something like this (untested) will work if the target tag cannot be nested
in another target tag:

- import xml.sax
- class tagSearcher(xml.sax.ContentHandler):
-     
-     def startDocument():
-         self.inTarget = False
- 
-     def startElement(name, attrs):
-         if name == targetName:
-             self.inTarget = True
-         elif inTarget = True:
-             # save element information
-     
-     def endElement(name):
-         if name == targetName:
-             self.inTarget = False
-             # test the saved information to see if you have the
-             # one you want:
-             #
-             #    if its the peice you are looking for, then
-             #        you can process the information
-             #        you have saved
-             #
-             #    if not, disgard the accumulated
-             #         information and move on
- 
-     def characters(content):
-         if self.inTarget == True:
-             # save the content
- 
- yourHandler = tagSearcher()
- yourParser = xml.sax.make_parser()
- yourParser.parse(inputXML, yourHandler)

Then you just walk through the document picking up and discarding each
target element type until you have the one that you are looking for.

> I need to detect them & then for each 1, i need to copy all the content
> b/w the element's start & end tags & create a smaller xml file.

Easy enough; but, with SAX you will have to recreate the tags from
the information that they contain because they will be skipped by the
character() events; so you will need to save the information from each tag
as you come across it.  This could probably be done more automatically
using saxutils.XMLGenerator; but, I haven't actually worked with it
before. xml.dom.pulldom also looks interesting

> 1. Can you point me to some examples/samples of using SAX, especially ,
> ones dealing with really large XML files.

There is nothing special about large files with SAX.  Sax is very simple.
It walks through the document and calls the the functions that you
give it for each event as it reaches varius elements.  Your callback
functions (methods of a handler) do everthing with the information.
SAX does nothing more then call your functions.  There are events for
reaching a  starting tag, an end tag, and characters between tags;
as well as some for beginning and ending a document.

> 2.This brings me to another q. which i forgot to ask in my OP(original
> post).  Is simply opening the file, & using reg ex to look for the element
> i need, a *good* approach ?  While researching my problem, some article
> seemed to advise against this, especially since its known apriori, that
> the file is an xml & since regex code gets complicated very quickly &
> is not very readable.
>
> But is that just a "style"/"elegance" issue, & for my particular problem
> (detecting a certain element, & then creating(writing) a smaller xml
> file corresponding to, each pair of start & end tags of said element),
> is the open file & regex approach, something you would recommend ?

It isn't an invalid approach if it works for your situatuation.  I have
used it before for very simple problems.  The thing is, XML is a context
free data format which makes it difficult to generate precise regular
expressions, especially where where tags of the same type can be nested.

It can be very error prone.  Its really easy to have a regex work for
your tests and fail, either by matching too much or failing to match,
because you didn't anticipate a given piece of data.  I wouldn't consider
it a robust solution.