Parsing HTML

Thu Sep 23 11:58:36 EDT 2004

Anders Eriksson wrote:
> Hello!
> 
> I want to extract some info from a some specific HTML pages, Microsofts
> International Word list (e.g.
> http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
> want to take all the words, both English and the other language and create
> a dictionary. so that I can look up About and get Om as the answer.
> 
> How is the best way to do this?
> 
> Please help!
> 
> // Anders

hi,
try this:

###############################################
import re, urllib2

#get page
s = 
urllib2.urlopen('http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm').read()

regex = re.compile('<td.*?>\d*\. (?:<b>)?(.*?)(?:</b>)?</td>')
myresult = regex.findall(s)
#print myresult

# map pairs in list to key:value in dict
nwords = range(len(myresult))
mydict = {}
for i in range(min(nwords),max(nwords),2):
     mydict[myresult[i]] = myresult[i+1]

#print mydict

# try some words
print mydict['wizard']
print mydict['Web site']
print mydict['unavailable']

##############################

which outputs:
guide
webbplats
inte tillgÃ¤nglig


Chris