HTML data extraction?

Mon Dec 22 18:49:56 EST 2003

[Sorry if this got posted twice, not sure what I did...]

Dave Kuhlman <dkuhlman at rexx.com> writes:
[...]
> I'd like something that would do things like the following:
> 
> - return the data which is inside a <b> tag which is inside a
>   <li> tag.
>   
> - return the data which is inside a <a> tag that has attribute
>   href="http://www.python.org".
> 
> - Etc.
> 
> It would be a sort of structured grep for HTML.

1. http://wwwsearch.sf.net/bits/pullparser.py

It's a port of Perl's HTML::TokeParser.

p = pullparser.PullParser(f)
p.get_tag("b")
p.get_tag("li")
print p.get_text()

p = pullparser.PullParser(f)
for tag in p:
    tag = p.get_tag("a")
    if dict(tag.attrs).get("href") == "http://www.python.org":
        print p.get_text()

I'll release a beta version in a day or so with a couple of minor
changes (including that .get_text() will no longer raise
NoMoreTagsError) and a proper tarball package.

2. stuff your data through mxTidy or uTidylib to get XHTML, then into
XPath from PyXML.

http://www.zvon.org/xxl/XPathTutorial/General/examples.html

In fact, tidying HTML is sometimes necessary even if you don't need
XHTML or a tree-based API.

3. microdom

http://www.xml.com/pub/a/2003/10/15/microdom.html

Haven't used it myself.

John