Regular Expression question

Paul McGuire ptmcg at austin.rr._bogus_.com
Mon Aug 21 09:57:10 EDT 2006


<stevebread at yahoo.com> wrote in message
news:1156153916.849933.178790 at 75g2000cwc.googlegroups.com...
> Hi, I am having some difficulty trying to create a regular expression.
>
> Consider:
>
> <tag1 name="john"/>  <br/> <tag2 value="adj__tall__"/>
> <tag1 name="joe"/>
> <tag1 name="jack"/>
> <tag2 value="adj__short__"/>
>
> Whenever a tag1 is followed by a tag 2, I want to retrieve the values
> of the tag1:name and tag2:value attributes. So my end result here
> should be
> john, tall
> jack, short
>

A pyparsing solution may not be a speed demon to run, but doesn't take too
long to write.  Some short explanatory comments:
- makeHTMLTags returns a tuple of opening and closing tags, but this example
does not use any closing tags, so simpler to just discard them (only use
zero'th return value)
- Your example includes not only <tag1> and <tag2> tags, but also a <br>
tag, which is presumably ignorable.
- The value returned from calling the searchString generator includes named
fields for the different tag attributes, making it easy to access the name
and value tag attributes.
- The expression generated by makeHTMLTags will also handle tags with other
surprising attributes that we didn't anticipate (such as "<br clear='all'/>"
or "<tag2 value='adj__short__' modifier='adv__very__'/>")
- Pyparsing leaves the values as "adj__tall__" and "adj__short__", but some
simple string slicing gets us the data we want

The pyparsing home page is at http://pyparsing.wikispaces.com.

-- Paul


from pyparsing import makeHTMLTags

tag1 = makeHTMLTags("tag1")[0]
tag2 = makeHTMLTags("tag2")[0]
br = makeHTMLTags("br")[0]

# define the pattern we're looking for, in terms of tag1 and tag2
# and specify that we wish to ignore <br> tags
patt = tag1 + tag2
patt.ignore(br)

for tokens in patt.searchString(data):
    print "%s, %s" % (tokens.startTag1.name, tokens.startTag2.value[5:-2])


Prints:
john, tall
jack, short


Printing tokens.dump() gives:
['tag1', ['name', 'jack'], True, 'tag2', ['value', 'adj__short__'], True]
- empty: True
- name: jack
- startTag1: ['tag1', ['name', 'jack'], True]
  - empty: True
  - name: jack
- startTag2: ['tag2', ['value', 'adj__short__'], True]
  - empty: True
  - value: adj__short__
- value: adj__short__





More information about the Python-list mailing list