Help Parsing an HTML File

Sat Feb 16 23:06:52 EST 2008

On Feb 15, 3:28 pm, egonslo... at gmail.com wrote:
> Hello Python Community,
>
> It'd be great if someone could provide guidance or sample code for
> accomplishing the following:
>
> I have a single unicode file that has  descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>

Pyparsing was mentioned earlier, here is a sample with some annotating
comments.

I'm a little worried when you say the file "fairly resembles HTML-
EXAMPLE."  With parsers, the devil is in the details, and if you have
scrambled this format - the HTML attributes are especially suspicious
- then the parser will need to be cleaned up to match the real input.
If the file being parsed really has proper HTML attributes (of the
form <tag attrname="attrvalue">), then you could simplify the code to
use the pyparsing method makeHTMLTags.  But the example I wrote
matches the example you posted.

-- Paul

# encoding=utf-8

from pyparsing import *

data = """
<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
... snip ...
"""
# define <XXX> and </XXX> tags
CL = CaselessLiteral
h1,h2,cmnt,br = \
    map(Suppress,
        map(CL,["<%s>" % s for s in "h1 h2 comment br".split()]))
h1end,h2end,cmntEnd,divEnd = \
    map(Suppress,
        map(CL,["</%s>" % s for s in "h1 h2 comment div".split()]))
# h1,h1end = makeHTMLTags("h1")

# define special format for <div>, incl. optional quoted string
"attribute"
div = "<" + CL("div") + Optional(QuotedString('"'))("name") + ">"
div.setParseAction(
    lambda toks: "name" in toks and toks.name.title() or "DIV")

# define <xxx>body</xxx> entries
h1Entry = h1 + SkipTo(h1end) + h1end
h2Entry = h2 + SkipTo(h2end) + h2end
comment = cmnt + SkipTo(cmntEnd) + cmntEnd
divEntry = div + SkipTo(divEnd) + divEnd

# just return nested tokens
grammar = (OneOrMore(Group(h1Entry +
            (Group(h2Entry +
                (OneOrMore(Group(divEntry))))))))
grammar.ignore(br)
grammar.ignore(comment)

results = grammar.parseString(data)
from pprint import pprint
pprint(results.asList())
print

# return nested tokens, with dict
grammar = Dict(OneOrMore(Group( h1Entry +
            Dict(Group(h2Entry +
                Dict(OneOrMore(Group(divEntry))))))))
grammar.ignore(br)
grammar.ignore(comment)
results = grammar.parseString(data)
print results.dump()

Prints:

[['Ros\xe9H1-1',
  ['Ros\xe9H2-1',
   ['DIV', 'Ros\xe9DIV-1'],
   ['Segment1', 'Ros\xe9SegmentDIV1-1'],
   ['Segment2', 'Ros\xe9SegmentDIV2-1'],
   ['Segment3', 'Ros\xe9SegmentDIV3-1']]],
 ['PinkH1-2',
  ['PinkH2-2', ['DIV', 'PinkDIV2-2'], ['Segment1',
'PinkSegmentDIV1-2']]],
 ['BlackH1-3',
  ['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]],
 ['YellowH1-4',
  ['YellowH2-4',
   ['DIV', 'YellowDIV2-4'],
   ['Segment1', 'YellowSegmentDIV1-4'],
   ['Segment2', 'YellowSegmentDIV2-4']]]]

[['Ros\xe9H1-1', ['Ros\xe9H2-1', ['DIV', 'Ros\xe9DIV-1'], ['Segment1',
'Ros\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]], ['PinkH1-2', ['PinkH2-2',
['DIV', 'PinkDIV2-2'], ['Segment1', 'PinkSegmentDIV1-2']]],
['BlackH1-3', ['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]], ['YellowH1-4', ['YellowH2-4', ['DIV',
'YellowDIV2-4'], ['Segment1', 'YellowSegmentDIV1-4'], ['Segment2',
'YellowSegmentDIV2-4']]]]
- BlackH1-3: [['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]]
  - BlackH2-3: [['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]
    - DIV: BlackDIV2-3
    - Segment1: BlackSegmentDIV1-3
- PinkH1-2: [['PinkH2-2', ['DIV', 'PinkDIV2-2'], ['Segment1',
'PinkSegmentDIV1-2']]]
  - PinkH2-2: [['DIV', 'PinkDIV2-2'], ['Segment1',
'PinkSegmentDIV1-2']]
    - DIV: PinkDIV2-2
    - Segment1: PinkSegmentDIV1-2
- RoséH1-1: [['Ros\xe9H2-1', ['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros
\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]]
  - RoséH2-1: [['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros
\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]
    - DIV: RoséDIV-1
    - Segment1: RoséSegmentDIV1-1
    - Segment2: RoséSegmentDIV2-1
    - Segment3: RoséSegmentDIV3-1
- YellowH1-4: [['YellowH2-4', ['DIV', 'YellowDIV2-4'], ['Segment1',
'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']]]
  - YellowH2-4: [['DIV', 'YellowDIV2-4'], ['Segment1',
'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']]
    - DIV: YellowDIV2-4
    - Segment1: YellowSegmentDIV1-4
    - Segment2: YellowSegmentDIV2-4