[Tutor] parsing html.

Wed Jan 16 14:39:31 CET 2008

Here is a pyparsing approach to your question.  I've added some comments to
walk you through the various steps.  By using pyparsing's makeHTMLTags
helper method, it is easy to write short programs to skim selected data tags
from out of an HTML page.

-- Paul

from pyparsing import makeHTMLTags, SkipTo

html = """
<A name=4></a><b>Table of Contents</b>
.........
<A name=5></a><b>Preface</b>
"""

# define the pattern to search for, using pyparsing makeHTMLTags helper
# makeHTMLTags constructs a very tolerant mini-pattern, to match HTML
# tags with the given tag name:
# - caseless matching on the tag name
# - embedded whitespace is handled
# - detection of empty tags (opening tags that end in "/")
# - detection of tag attributes
# - returning parsed data using results names for attribute values
# makeHTMLTags actually returns two patterns, one for the opening tag
# and one for the closing tag
aStart,aEnd = makeHTMLTags("A")
bStart,bEnd = makeHTMLTags("B")
pattern = aStart + aEnd + bStart + SkipTo(bEnd)("text") + bEnd

# search the input string - dump matched structure for each match
for pp in pattern.searchString(html):
    print pp.dump()
    print pp.startA.name, pp.text

# parse input and build a dict using the results
nameDict = dict( (pp.startA.name,pp.text) for pp in
pattern.searchString(html) )
print nameDict

The last line of the output is the dict that is created:

{'5': 'Preface', '4': 'Table of Contents'}