Parsing HTML

Anders Eriksson anders.eriksson at
Mon Sep 27 01:48:35 EDT 2004

I would like to thank everyone that have help on this!

The solution I settled for was a using BeautifulSoup and a script that Mr.
Leonard Richardson sent me.

Now to the next part of the problem, how to manage Unicode....

// Anders
To promote the usage of BeautifulSoup here is the script by Mr. Leonard

import urllib
import re
from BeautifulSoup import BeautifulSoup

text = urllib.urlopen(URL).read()
# remove all <b> and </b>
p = re.compile('\<b\>|\</b\>')
text = p.sub('',text)

# soupify it
soup = BeautifulSoup(text)

def unmunge(value):
    """Use this method to turn, eg "74. <b>Help</b> menu" into "Help menu",
    probably using a regular expression."""
    return value[value.find('.')+2:]

d = []
cols = soup.fetch('td', {'width' : '33%'})
for i in range(0, len(cols)):
    if i % 3 != 2: #Every third column is a note which we ignore.
        value = unmunge(cols[i].renderContents())
        if not d or len(d[-1]) == 2:
            #English term
            #Swedish term
d = dict(d)
for key, val in d.items():
    print "%s = %s" % (key, val)

