lisp is winner in DOM parsing contest! 8-]

Mon Jul 12 08:32:32 EDT 2004

[Paul]
> Rather than either reading incrementally or else slurping in the
> entire document in many-noded glory, I wonder if anyone's implemented
> a parser that scans over the XML doc and makes a compact sequential
> representation of the tree structure, and then provides access methods
> that let you traverse the tree as if it were a real DOM, by fetching
> the appropriate strings from the (probably mmap'ed) disk file as you
> walk around in the tree.

It's not exactly what you describe here, but xml.dom.pulldom is roughly
this.  You access the higher-level nodes by SAX, thus not using massive
amounts of memory, but you can access the children of those higher-level
elements using DOM.  The canonical example is processing a large number of
XML records - the XML document is arbitrarily large but the individual
records aren't.  Pulldom passes each record to you SAX-style, and you use
DOM to process the record.

Uche Ogbuji has a short article on xml.dom.pulldom here:
http://www-106.ibm.com/developerworks/xml/library/x-tipulldom.html

Here's the example from that article.  Line 16 is the key to it - that's the
point at which you switch from SAX to DOM:

     1  #Get the first line in Act IV, scene II
     2  
     3  from xml.dom import pulldom
     4  
     5  hamlet_file = open("hamlet.xml")
     6  
     7  events = pulldom.parse(hamlet_file)
     8  act_counter = 0
     9  for (event, node) in events:
    10      if event == pulldom.START_ELEMENT:
    11          if node.tagName == "ACT":
    12              act_counter += 1
    13              scene_counter = 1
    14          if node.tagName == "SCENE":
    15              if act_counter == 4 and scene_counter == 2:
    16                  events.expandNode(node)
    17                  #Traditional DOM processing starts here
    18                  #Get all descendant elements named "LINE"
    19                  line_nodes = node.getElementsByTagName("LINE")
    20                  #Print the text data of the text node
    21                  #of the first LINE element
    22                  print line_nodes[0].firstChild.data
    23              scene_counter += 1

-- 
Richie Hindle
richie at entrian.com