HTML data extraction?
John J. Lee
jjl at pobox.com
Mon Dec 22 18:49:56 EST 2003
[Sorry if this got posted twice, not sure what I did...]
Dave Kuhlman <dkuhlman at rexx.com> writes:
[...]
> I'd like something that would do things like the following:
>
> - return the data which is inside a <b> tag which is inside a
> <li> tag.
>
> - return the data which is inside a <a> tag that has attribute
> href="http://www.python.org".
>
> - Etc.
>
> It would be a sort of structured grep for HTML.
1. http://wwwsearch.sf.net/bits/pullparser.py
It's a port of Perl's HTML::TokeParser.
p = pullparser.PullParser(f)
p.get_tag("b")
p.get_tag("li")
print p.get_text()
p = pullparser.PullParser(f)
for tag in p:
tag = p.get_tag("a")
if dict(tag.attrs).get("href") == "http://www.python.org":
print p.get_text()
I'll release a beta version in a day or so with a couple of minor
changes (including that .get_text() will no longer raise
NoMoreTagsError) and a proper tarball package.
2. stuff your data through mxTidy or uTidylib to get XHTML, then into
XPath from PyXML.
http://www.zvon.org/xxl/XPathTutorial/General/examples.html
In fact, tidying HTML is sometimes necessary even if you don't need
XHTML or a tree-based API.
3. microdom
http://www.xml.com/pub/a/2003/10/15/microdom.html
Haven't used it myself.
John
More information about the Python-list
mailing list