HTML data extraction?

Dave Kuhlman dkuhlman at rexx.com
Mon Dec 22 13:29:49 EST 2003


I recently read an article by Jon Udell about extracting data from
Web pages as a poor person's Web services.  So, I have a question:

Is there any Python support for finding and extracting information
from HTML documents.

I'd like something that would do things like the following:

- return the data which is inside a <b> tag which is inside a
  <li> tag.
  
- return the data which is inside a <a> tag that has attribute
  href="http://www.python.org".

- Etc.

It would be a sort of structured grep for HTML.

I've found the HTMLParser and htmllib modules in the Python
standard library, but I'm wondering if there is anything at a
higher level.

Web searches did not turn up anything interesting.

Thanks for help.

Dave

-- 
http://www.rexx.com/~dkuhlman
dkuhlman at rexx.com




More information about the Python-list mailing list