Parsing HTML
Fredrik Lundh
fredrik at pythonware.com
Fri Sep 24 04:35:04 EDT 2004
Thomas Guettler wrote:
> If you want to parse many HTML pages, you can use tidy to create
> xml and then use an xml parser. There are too many ways HTML can be
> broken.
including the page Anders pointed to, which is too broken for tidy's
default settings:
line 1 column 1 - Warning: specified input encoding (iso-8859-1) does
not match actual input encoding (utf-8)
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Warning: discarding unexpected <html>
line 9 column 1 - Error: <xml> is not recognized!
... snip ...
260 warnings, 14 errors were found! Not all warnings/errors were shown.
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
you can fix this either by tweaking the tidy settings, or by fixing up the
document before you parse it (note the first warning: if you're not care-
ful, you may end up with unusable swedish text).
I've attached a script based on my ElementTidy binding for tidy. see
alternative 1 below. usage:
URL = "http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"
wordlist = parse_microsoft_wordlist(URL)
for item in wordlist:
print item
the wordlist contains (english word, swedish word), using Unicode where
appropriate.
you can get elementtree and elementtidy via
http://effbot.org/zone/element.htm
http://effbot.org/zone/element-tidylib.htm
on the other hand, for this specific case, a regular expression-based approach
is probably easier. see alternative 2 below for one way to do it.
</F>
# --------------------------------------------------------------------
# alternative 1: using the TIDY->XML approach
from elementtidy.TidyHTMLTreeBuilder import parse
from urllib import urlopen
from StringIO import StringIO
import re
def parse_microsoft_wordlist(url):
text = urlopen(url).read()
# get rid of BOM crud
text = re.sub("^[^<]*", "", text) # bom crud
# the page seems to be UTF-8 encoded, but it doesn't say so;
# convert it to Latin 1 to simplify further processing
text = unicode(text, "utf-8").encode("iso-8859-1")
# get rid of things that Tidy doesn't like
text = re.sub("(?i)</?xml*?>", "", text) # embedded <xml>
text = re.sub("(?i)</?ms.*?>", "", text) # <mshelp> stuff
# now, let's process it
tree = parse(StringIO(text))
# look for TR tags, and pick out the text from the first two TDs
wordlist = []
for row in tree.getiterator(XHTML("tr")):
cols = row.findall(XHTML("td"))
if len(cols) == 3:
wordlist.append((fixword(cols[0]), fixword(cols[1])))
return wordlist
# helpers
def XHTML(tag):
# map a tag to its XHTML name
return "{http://www.w3.org/1999/xhtml}" + tag
def fixword(column):
# get text from TD and subelements
word = flatten(column)
# get rid of leading number and whitespace
word = re.sub("^\d+\.\s+", "", word)
return word
def flatten(node):
# get text from an element and all its subelements
text = ""
if node.text:
text += node.text
for subnode in node:
text += flatten(subnode)
if subnode.tail:
text += subnode.tail
return text
# --------------------------------------------------------------------
# alternative 2: using regular expressions
import re
from urllib import urlopen
def parse_microsoft_wordlist(url):
text = urlopen(url).read()
text = unicode(text, "utf-8")
pattern = "(?s)<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>"
def fixword(word):
# get rid of leading nnn.
word = re.sub("^\d+\.\s+", "", word)
# get rid of embedded tags
word = re.sub("<[^>]+>", "", word)
return word
wordlist = []
for w1, w2 in re.findall(pattern, text):
wordlist.append((fixword(w1), fixword(w2)))
return wordlist
# --------------------------------------------------------------------
More information about the Python-list
mailing list