converting an html table to a tree

Sami Hangaslammi sami.hangaslammi.spam.trap at yomimedia.fi
Fri Aug 25 05:35:28 EDT 2000


"Ian Lipsky" <NOSPAM at pacificnet.net> wrote in message
news:3ufp5.440$bw2.8538 at newsread2.prod.itd.earthlink.net...

> I know i saw a bit of code dealing with doing something like that...i
think
> it was using regexp? i'll have to dig it up.

A very simple solution using regexp (ignoring all tags except table,tr and
td) for creating a list of all tables in a document:

|mport re
|
|def rex_tag(tag):
|    return re.compile("(?msi)<%s.*?>(.*?)</%s.*?>" % (tag,tag))
|
|rex_table = rex_tag("table")
|rex_row = rex_tag("tr")
|rex_data = rex_tag("td")
|
|def find_data(row):
|    return rex_data.findall(row)
|
|def find_rows(table):
|    return map(find_data, rex_row.findall(table))
|
|def parse_tables(html):
|    return map(find_rows, rex_table.findall(html))

Using a test.html file like this:

<html>
<head><title>test table</title></head>
<body>
  <table width="50%">
    <tr>
      <td>foo</td>
      <td>bar</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
    </tr>
  </table>
</body>
</html>

The parse_tables function returns:
[
  [
    ['foo', 'bar'],
    ['1', '2']
  ]
]






More information about the Python-list mailing list