[Tutor] parsing html.

Paul McGuire ptmcg at austin.rr.com
Wed Jan 16 14:39:31 CET 2008


Here is a pyparsing approach to your question.  I've added some comments to
walk you through the various steps.  By using pyparsing's makeHTMLTags
helper method, it is easy to write short programs to skim selected data tags
from out of an HTML page.

-- Paul


from pyparsing import makeHTMLTags, SkipTo

html = """
<A name=4></a><b>Table of Contents</b>
.........
<A name=5></a><b>Preface</b>
"""

# define the pattern to search for, using pyparsing makeHTMLTags helper
# makeHTMLTags constructs a very tolerant mini-pattern, to match HTML
# tags with the given tag name:
# - caseless matching on the tag name
# - embedded whitespace is handled
# - detection of empty tags (opening tags that end in "/")
# - detection of tag attributes
# - returning parsed data using results names for attribute values
# makeHTMLTags actually returns two patterns, one for the opening tag
# and one for the closing tag
aStart,aEnd = makeHTMLTags("A")
bStart,bEnd = makeHTMLTags("B")
pattern = aStart + aEnd + bStart + SkipTo(bEnd)("text") + bEnd

# search the input string - dump matched structure for each match
for pp in pattern.searchString(html):
    print pp.dump()
    print pp.startA.name, pp.text
    
# parse input and build a dict using the results
nameDict = dict( (pp.startA.name,pp.text) for pp in
pattern.searchString(html) )
print nameDict


The last line of the output is the dict that is created:

{'5': 'Preface', '4': 'Table of Contents'}






More information about the Tutor mailing list