Parsing HTML
Anders Eriksson
anders.eriksson at morateknikutveckling.se
Mon Sep 27 01:48:35 EDT 2004
I would like to thank everyone that have help on this!
The solution I settled for was a using BeautifulSoup and a script that Mr.
Leonard Richardson sent me.
Now to the next part of the problem, how to manage Unicode....
// Anders
--
To promote the usage of BeautifulSoup here is the script by Mr. Leonard
Richarson
import urllib
import re
from BeautifulSoup import BeautifulSoup
URL =
"http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"
text = urllib.urlopen(URL).read()
# remove all <b> and </b>
p = re.compile('\<b\>|\</b\>')
text = p.sub('',text)
# soupify it
soup = BeautifulSoup(text)
def unmunge(value):
"""Use this method to turn, eg "74. <b>Help</b> menu" into "Help menu",
probably using a regular expression."""
return value[value.find('.')+2:]
d = []
cols = soup.fetch('td', {'width' : '33%'})
for i in range(0, len(cols)):
if i % 3 != 2: #Every third column is a note which we ignore.
value = unmunge(cols[i].renderContents())
if not d or len(d[-1]) == 2:
#English term
d.append([value])
else:
#Swedish term
d[-1].append(value)
d = dict(d)
for key, val in d.items():
print "%s = %s" % (key, val)
More information about the Python-list
mailing list