lisp is winner in DOM parsing contest! 8-]
Richie Hindle
richie at entrian.com
Mon Jul 12 08:32:32 EDT 2004
[Paul]
> Rather than either reading incrementally or else slurping in the
> entire document in many-noded glory, I wonder if anyone's implemented
> a parser that scans over the XML doc and makes a compact sequential
> representation of the tree structure, and then provides access methods
> that let you traverse the tree as if it were a real DOM, by fetching
> the appropriate strings from the (probably mmap'ed) disk file as you
> walk around in the tree.
It's not exactly what you describe here, but xml.dom.pulldom is roughly
this. You access the higher-level nodes by SAX, thus not using massive
amounts of memory, but you can access the children of those higher-level
elements using DOM. The canonical example is processing a large number of
XML records - the XML document is arbitrarily large but the individual
records aren't. Pulldom passes each record to you SAX-style, and you use
DOM to process the record.
Uche Ogbuji has a short article on xml.dom.pulldom here:
http://www-106.ibm.com/developerworks/xml/library/x-tipulldom.html
Here's the example from that article. Line 16 is the key to it - that's the
point at which you switch from SAX to DOM:
1 #Get the first line in Act IV, scene II
2
3 from xml.dom import pulldom
4
5 hamlet_file = open("hamlet.xml")
6
7 events = pulldom.parse(hamlet_file)
8 act_counter = 0
9 for (event, node) in events:
10 if event == pulldom.START_ELEMENT:
11 if node.tagName == "ACT":
12 act_counter += 1
13 scene_counter = 1
14 if node.tagName == "SCENE":
15 if act_counter == 4 and scene_counter == 2:
16 events.expandNode(node)
17 #Traditional DOM processing starts here
18 #Get all descendant elements named "LINE"
19 line_nodes = node.getElementsByTagName("LINE")
20 #Print the text data of the text node
21 #of the first LINE element
22 print line_nodes[0].firstChild.data
23 scene_counter += 1
--
Richie Hindle
richie at entrian.com
More information about the Python-list
mailing list