[XML-SIG] Does anyone do DOM navigation anymore?

Tue Jul 6 17:07:45 CEST 2004

On Mon, Jul 05, 2004 at 06:37:22PM +0800, Derek Fountain wrote:
> I've spent the last few days tinkering with DOM trees and the DOM API. A 
> couple of years back I wrote a fairly complex application which found the 
> data it required using this nextSibling, firstChild, sort of navigation. I 
> recall the development experience wasn't a terribly happy one, and I have 
> always presumed that XPATH was largely invented to get past all this mucking 
> about.
> 
> So it occurs to me to ask on the SIG list: do people still use the original 
> DOM style navigation? When is it preferable to XPATH? Why, in short, is the 
> whole "document hopping" idea not deprecated?!

My main use of the DOM has been to scrape the USPTO[1] pages containing 
individual records (sample patent[2]).  I don't count elements; rather, 
I use clues that are both structural and semantic.  Typically, the 
elements I want are labeled, either in a preceding table cell, or in 
a preceeding center, bold, or italicized text element.  E.g. to find 
the patent number and issue date of a patent, I use 
getElementsByTagName() to find all table cells, then look for one 
whose text content reduces to "United States Patent".  At this point 
I know that the next sibling TD contains the patent number, and that 
the second cell of the succeeding row contains the issue date (go up to
parent TR, go up to parent TBODY, choose the second TD of the second 
child TR).  Or, to find the abstract, I examine the direct children 
of BODY until I find a CENTER element whose text reduces to "Abstract",
whereupon I accumulate text until the next HR.  I'm sure this is very
un-XML-like, but I need this data and the approach works.

I use twisted.web.microdom with the 'beExtremelyLenient' flag set to
True.  There are some crude HTML flaws that first must be fixed, then I
run the document through mx.Tidy, then I build the extremely lenient 
microdom.

Chuck

[1] http://www.uspto.gov/patft/index.html
[2] http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/search-bool.html&r=4&f=G&l=50&co1=AND&d=ptxt&s1=tobacco&OS=tobacco&RS=tobacco