Parsing HTML

Fri Sep 24 04:35:04 EDT 2004

Thomas Guettler wrote:

> If you want to parse many HTML pages, you can use tidy to create
> xml and then use an xml parser. There are too many ways HTML can be
> broken.

including the page Anders pointed to, which is too broken for tidy's
default settings:

    line 1 column 1 - Warning: specified input encoding (iso-8859-1) does
    not match actual input encoding (utf-8)
    line 1 column 1 - Warning: missing <!DOCTYPE> declaration
    line 3 column 1 - Warning: discarding unexpected <html>
    line 9 column 1 - Error: <xml> is not recognized!
    ... snip ...
    260 warnings, 14 errors were found! Not all warnings/errors were shown.

    This document has errors that must be fixed before
    using HTML Tidy to generate a tidied up version.

you can fix this either by tweaking the tidy settings, or by fixing up the
document before you parse it (note the first warning: if you're not care-
ful, you may end up with unusable swedish text).

I've attached a script based on my ElementTidy binding for tidy.  see
alternative 1 below.  usage:

    URL = "http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"

    wordlist = parse_microsoft_wordlist(URL)

    for item in wordlist:
        print item

the wordlist contains (english word, swedish word), using Unicode where
appropriate.

you can get elementtree and elementtidy via

    http://effbot.org/zone/element.htm
    http://effbot.org/zone/element-tidylib.htm

on the other hand, for this specific case, a regular expression-based approach
is probably easier.  see alternative 2 below for one way to do it.

</F>

# --------------------------------------------------------------------
# alternative 1: using the TIDY->XML approach

from elementtidy.TidyHTMLTreeBuilder import parse
from urllib import urlopen
from StringIO import StringIO
import re

def parse_microsoft_wordlist(url):

    text = urlopen(url).read()

    # get rid of BOM crud
    text = re.sub("^[^<]*", "", text) # bom crud

    # the page seems to be UTF-8 encoded, but it doesn't say so;
    # convert it to Latin 1 to simplify further processing
    text = unicode(text, "utf-8").encode("iso-8859-1")

    # get rid of things that Tidy doesn't like
    text = re.sub("(?i)</?xml*?>", "", text) # embedded <xml>
    text = re.sub("(?i)</?ms.*?>", "", text) # <mshelp> stuff

    # now, let's process it
    tree = parse(StringIO(text))

    # look for TR tags, and pick out the text from the first two TDs
    wordlist = []
    for row in tree.getiterator(XHTML("tr")):
        cols = row.findall(XHTML("td"))
        if len(cols) == 3:
            wordlist.append((fixword(cols[0]), fixword(cols[1])))
    return wordlist

# helpers

def XHTML(tag):
    # map a tag to its XHTML name
    return "{http://www.w3.org/1999/xhtml}" + tag

def fixword(column):
    # get text from TD and subelements
    word = flatten(column)
    # get rid of leading number and whitespace
    word = re.sub("^\d+\.\s+", "", word)
    return word

def flatten(node):
    # get text from an element and all its subelements
    text = ""
    if node.text:
        text += node.text
    for subnode in node:
        text += flatten(subnode)
        if subnode.tail:
            text += subnode.tail
    return text

# --------------------------------------------------------------------
# alternative 2: using regular expressions

import re
from urllib import urlopen

def parse_microsoft_wordlist(url):

    text = urlopen(url).read()

    text = unicode(text, "utf-8")

    pattern = "(?s)<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>"

    def fixword(word):
        # get rid of leading nnn.
        word = re.sub("^\d+\.\s+", "", word)
        # get rid of embedded tags
        word = re.sub("<[^>]+>", "", word)
        return word

    wordlist = []
    for w1, w2 in re.findall(pattern, text):
        wordlist.append((fixword(w1), fixword(w2)))

    return wordlist

# --------------------------------------------------------------------