Extract Information from Tables in html

Stefan Behnel stefan_ml at behnel.de
Fri Sep 5 12:05:20 EDT 2008


Hi,

Jackie Wang wrote:
> Here is a html code:
> 
> <td valign="top" headers="col4">
> 
>          Premier Community Bank of Southwest Florida
>          <br />
>          Fort Myers, FL
> 
> </td>
> 
> My question is how I can extract the strings and get the results:
> Premier Community Bank of Southwest Florida; Fort Myers, FL

Use lxml.html. Something like this should do what you want:

  >>> from lxml import html
  >>> tree = html.parse("http://server.org/thefile.html")
  >>> all_tds = tree.findall("//td")
  >>> for td in all_tds:
  ...     print( td.xpath("normalize-space()") )

Tweak as you see fit, tree iteration is at your service in case you need more.

http://codespeak.net/lxml/

Stefan



More information about the Python-list mailing list