Parsing html with Beautifulsoup

Johann Spies jspies at sun.ac.za
Thu Dec 10 04:15:19 EST 2009


I am trying to get csv-output from a html-file.

With this code I had a little success:
=========================
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re

f = open("configuration.html","r")
g = open("configuration.csv",'w')
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
    rows = table.findAll('tr')
    for th in rows[0]:
        t = th.find(text=True)
        g.write(t)
        g.write(',')
#        print(','.join(t))
        
    for tr in rows:
        cols = tr.findAll('td')
        for td in cols:
            try:
                t = td.find(text=True).replace(' ','')
                g.write(t)
            except:
                g.write ('')
            g.write(",")
        g.write("\n")
===============================

producing output like this:

RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS,
1,,,,drop,Log,Any,,,
2,All Users at Any,,Any,clientencrypt,Log,Any,,,
3,Any,Any,,drop,None,Any,,,
4,,,,drop,None,Any,,,
...

It left out all the non-plaintext parts of <td></td>

I then tried using 

t.renderContents and then got something like this (one line broken into
many for the sake of this email):

1,<img src=icons/group.png> <a href=#OBJ_sunetint>
sunetint</A><BR>, 
<img src=icons/gateway_cluster.png> <a>href=#OBJ_Rainwall_Cluster
>Rainwall_Cluster</A> <BR>,
<img>src=icons/udp.png> <a href=#SVC_IKE >IKE</a><br>,
<img src=icons/drop.png> drop,
<img src=icons/log.png> Log ,
<img src=icons/any.png> Any<br> ,
<img src=icons/gateway_cluster.png> <a href=#OBJ_Rainwall_Cluster
>Rainwall_Cluster</A> <BR> , 

How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for <img src=icons/group.png> <a
href=#OBJ_sunetint>sunetint</A><BR>

and still provide the text-parts in the <td>'s with plain text?

I have experimented a little bit with regular expressions, but could
so far not find a solution.

Regards
Johann
-- 
Johann Spies          Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

     "Lo, children are an heritage of the LORD: and the  
      fruit of the womb is his reward."        Psalms 127:3 



More information about the Python-list mailing list