Parsing HTML

Mon Sep 27 01:48:35 EDT 2004

I would like to thank everyone that have help on this!

The solution I settled for was a using BeautifulSoup and a script that Mr.
Leonard Richardson sent me.

Now to the next part of the problem, how to manage Unicode....

// Anders
-- 
To promote the usage of BeautifulSoup here is the script by Mr. Leonard
Richarson

import urllib
import re
from BeautifulSoup import BeautifulSoup

URL =
"http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"
text = urllib.urlopen(URL).read()
# remove all <b> and </b>
p = re.compile('\<b\>|\</b\>')
text = p.sub('',text)

# soupify it
soup = BeautifulSoup(text)

def unmunge(value):
    """Use this method to turn, eg "74. <b>Help</b> menu" into "Help menu",
    probably using a regular expression."""
    return value[value.find('.')+2:]

d = []
cols = soup.fetch('td', {'width' : '33%'})
for i in range(0, len(cols)):
    if i % 3 != 2: #Every third column is a note which we ignore.
        value = unmunge(cols[i].renderContents())
        if not d or len(d[-1]) == 2:
            #English term
            d.append([value])
        else:
            #Swedish term
            d[-1].append(value)
d = dict(d)
for key, val in d.items():
    print "%s = %s" % (key, val)