python/xpath question...
John J. Lee
jjlee at reportlab.com
Sun Jul 9 10:33:55 EDT 2006
(Damn gmane's authorizor, I think I lost four postings because the
auth messages went to my work email address (and I thought the
authorization was supposed to be one-time only per group anyway??). I
deleted them as spam since I hadn't posted from there for days :-(
Grrr. At least I could reconstruct this one...)
"bruce" <bedouglas at earthlink.net> writes:
> for guys with python/xpath expertise..
>
> i'm playing with xpath.. and i'm trying to solve an issue...
>
> i have the following kind of situation where i'm trying to get certain data.
>
> i have a bunch of tr/td...
>
> i can create an xpath, that gets me all of the tr.. i only want to get the
> sibling tr up until i hit a 'tr' that has a 'th' anybody have an idea as to
> how this query might be created?..
[...]
((//tr/th)[2]/../following-sibling::tr/td/..)[count(.|((//tr/th)[3]/../preceding-sibling::*))=count((//tr/th)[3]/../preceding-sibling::*)]
which makes use of the following idiom for writing an intersection:
$set1[count(.|$set2)=count($set2)]
and gets the second group in the sequence you describe. IMHO, this
illustrates what happens when XPath is pushed too far ;-) I don't see
an easier way, but perhaps I missed one.
Example code:
(Note that the expression used here doesn't get any trailing group of
tr elements if there's no terminating tr/th -- that fits your
specification, but may not be what you really wanted. To fix that,
meditate on the above expression for an hour or two <0.8 wink>.)
#---------------------------------------------------------
def xpath(path, source):
import StringIO
import pprint
from lxml import etree
f = StringIO.StringIO(source)
tree = etree.parse(f)
r = tree.xpath(path)
#return "\n".join(etree.tostring(el) for el in r)
return pprint.pformat([etree.tostring(el) for el in r])
simple = """\
<html>
<tr><th>A</th></tr>
<tr><td>B</td></tr>
<tr><td>C</td></tr>
<tr><th>D</th></tr>
<tr><td>E</td></tr>
<tr><td>F</td></tr>
<tr><th>G</th></tr>
<tr><td>H</td></tr>
<tr><td>I</td></tr>
</html>
"""
for i in range(3):
expr = '((//tr/th)[%s]/../following-sibling::tr/td/..)[count(.|((//tr/th)[%s]/../preceding-sibling::*))=count((//tr/th)[%s]/../preceding-sibling::*)]' % (i+1, i+2, i+2)
print "---------------------"
print xpath(expr, simple)
#---------------------------------------------------------
john[0]$ tst.py
---------------------
['<tr><td>B</td></tr>\n', '<tr><td>C</td></tr>\n']
---------------------
['<tr><td>E</td></tr>\n', '<tr><td>F</td></tr>\n']
---------------------
[]
Knowing what you're doing, though, you'd probably be better off with
BeautifulSoup than XPath. Also note that mechanize (which I know
you're using) only supports BeautifulSoup 2 at present. You can't use
BeautifulSoup 3 yet (I hope to fix that 'RSN').
John
More information about the Python-list
mailing list