python/xpath question...

Sun Jul 9 10:33:55 EDT 2006

(Damn gmane's authorizor, I think I lost four postings because the
auth messages went to my work email address (and I thought the
authorization was supposed to be one-time only per group anyway??).  I
deleted them as spam since I hadn't posted from there for days :-(
Grrr.  At least I could reconstruct this one...)

"bruce" <bedouglas at earthlink.net> writes:

> for guys with python/xpath expertise..
> 
> i'm playing with xpath.. and i'm trying to solve an issue...
> 
> i have the following kind of situation where i'm trying to get certain data.
> 
> i have a bunch of tr/td...
> 
> i can create an xpath, that gets me all of the tr.. i only want to get the
> sibling tr up until i hit a 'tr' that has a 'th' anybody have an idea as to
> how this query might be created?..
[...]

((//tr/th)[2]/../following-sibling::tr/td/..)[count(.|((//tr/th)[3]/../preceding-sibling::*))=count((//tr/th)[3]/../preceding-sibling::*)]

which makes use of the following idiom for writing an intersection:

$set1[count(.|$set2)=count($set2)]

and gets the second group in the sequence you describe.  IMHO, this
illustrates what happens when XPath is pushed too far ;-) I don't see
an easier way, but perhaps I missed one.

Example code:

(Note that the expression used here doesn't get any trailing group of
tr elements if there's no terminating tr/th -- that fits your
specification, but may not be what you really wanted.  To fix that,
meditate on the above expression for an hour or two <0.8 wink>.)

#---------------------------------------------------------
def xpath(path, source):
    import StringIO
    import pprint
    from lxml import etree
    f = StringIO.StringIO(source)
    tree = etree.parse(f)
    r = tree.xpath(path)
    #return "\n".join(etree.tostring(el) for el in r)
    return pprint.pformat([etree.tostring(el) for el in r])

simple = """\
<html>
<tr><th>A</th></tr>
<tr><td>B</td></tr>
<tr><td>C</td></tr>
<tr><th>D</th></tr>
<tr><td>E</td></tr>
<tr><td>F</td></tr>
<tr><th>G</th></tr>
<tr><td>H</td></tr>
<tr><td>I</td></tr>
</html>
"""

for i in range(3):
    expr = '((//tr/th)[%s]/../following-sibling::tr/td/..)[count(.|((//tr/th)[%s]/../preceding-sibling::*))=count((//tr/th)[%s]/../preceding-sibling::*)]' % (i+1, i+2, i+2)
    print "---------------------"
    print xpath(expr, simple)
#---------------------------------------------------------

john[0]$ tst.py
---------------------
['<tr><td>B</td></tr>\n', '<tr><td>C</td></tr>\n']
---------------------
['<tr><td>E</td></tr>\n', '<tr><td>F</td></tr>\n']
---------------------
[]

Knowing what you're doing, though, you'd probably be better off with
BeautifulSoup than XPath.  Also note that mechanize (which I know
you're using) only supports BeautifulSoup 2 at present.  You can't use
BeautifulSoup 3 yet (I hope to fix that 'RSN').

John