Extract Information from Tables in html

Peter Pearson ppearson at nowhere.invalid
Fri Sep 5 11:03:29 EDT 2008


On Fri, 5 Sep 2008 11:35:14 -0300, Walter Cruz <walter.php at gmail.com> wrote:
> On Fri, Sep 5, 2008 at 11:29 AM, Jackie Wang <jackie.python at gmail.com> wrote:
>> Here is a html code:
>>
>> <td valign="top" headers="col4">
>>
>>         Premier Community Bank of Southwest Florida
>>         <br />
>>         Fort Myers, FL
>>
>> </td>
>>
>> My question is how I can extract the strings and get the results:
>> Premier Community Bank of Southwest Florida; Fort Myers, FL
>
> Use BeautifulSoup.

I agree, BeautifulSoup is wonderful.  Here are snippets of
code that I recently used to locate (in each of many HTML
files) the table that contained a particular heading:

  from BeautifulSoup import BeautifulSoup
  import re
  ...
  inlines = ifd.readlines()
  soup = BeautifulSoup( " ".join( inlines ) )
  x = soup.findAll( text = re.compile( "Technical Requirements - General" ) )
  x = x[0].parent
  while x.name != "table":
    x = x.parent
  tr_list = x.findAll( "tr", recursive = False )
  print "Table has %d rows." % len( tr_list )


-- 
To email me, substitute nowhere->spamcop, invalid->net.



More information about the Python-list mailing list