Parsing HTML
Chris McD
theratpack91 at yahoo.co.uk
Thu Sep 23 11:58:36 EDT 2004
Anders Eriksson wrote:
> Hello!
>
> I want to extract some info from a some specific HTML pages, Microsofts
> International Word list (e.g.
> http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
> want to take all the words, both English and the other language and create
> a dictionary. so that I can look up About and get Om as the answer.
>
> How is the best way to do this?
>
> Please help!
>
> // Anders
hi,
try this:
###############################################
import re, urllib2
#get page
s =
urllib2.urlopen('http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm').read()
regex = re.compile('<td.*?>\d*\. (?:<b>)?(.*?)(?:</b>)?</td>')
myresult = regex.findall(s)
#print myresult
# map pairs in list to key:value in dict
nwords = range(len(myresult))
mydict = {}
for i in range(min(nwords),max(nwords),2):
mydict[myresult[i]] = myresult[i+1]
#print mydict
# try some words
print mydict['wizard']
print mydict['Web site']
print mydict['unavailable']
##############################
which outputs:
guide
webbplats
inte tillgänglig
Chris
More information about the Python-list
mailing list