How can I exclude a word by using re?

Paul McGuire ptmcg at austin.rr.com
Tue Aug 16 09:18:28 EDT 2005


Just as with re you were using "?P<xxx>" to assign the matching text to
the variable "xxx", pyparsing allows you to associate a name with an
element of your grammar using setResultsName.

Here is your original re:
r=re.compile(ur'valign=top>(?P­­<number>\d{1,2})</td><td[^>]*­>­\s{0,2}'
 ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
 ur'(?P<name>.+)</td>',re.UNICO­­DE|re.IGNORECASE)

Here is the pyparsing expression:
valign + number.setResultsName("number"­) + tdEnd + \
            tdStart + SkipTo(aStart) + aStart + \
            SkipTo(tdEnd) + tdEnd

Here are the re and pyparsing pieces side by side:
re => pyparsing
-----------------------
valign=top>    =>  valign = CaselessLiteral("valign=top>")
(?P­­<number>\d{1,2})    =>    number = Word(nums),
number.setResultsName("number")
</td>       =>    tdEnd
<td[^>]*­>­    =>   tdStart
\s{0,2}       =>  I don't know what this re does, so I just used
SkipTo(aStart)
<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>     =>  aStart (which
returns a value whose named attributes correspond to the HTML
attributes, such as href)
(?P<name>.+)   =>   SkipTo(tdEnd)  *** here is where we'll make our
change ***
</td>    =>  tdEnd

To capture the body of the second <td></td> tag pair, we'll add
setResultsName("name") to the pyparsing expression:
mp3entry = valign + number.setResultsName("number"­) + tdEnd + \
            tdStart + SkipTo(aStart) + aStart + \
            SkipTo(tdEnd)setResultsName("name") + tdEnd

Now you should be able to extract the data using:
for toks,s,e in mp3Entry.scanString(targetHTML­):
    print toks.number, toks.starta.href, toks.name

Good luck!
-- Paul




More information about the Python-list mailing list