Saving search results in a dictionary

Paul McGuire ptmcg at austin.rr._bogus_.com
Thu Jun 17 13:56:45 EDT 2004


"Lukas Holcik" <xholcik1 at fi.muni.cz> wrote in message
news:Pine.LNX.4.60.0406171557330.16166 at nymfe30.fi.muni.cz...
> Hi everyone!
>
> How can I simply search text for regexps (lets say <a
> href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a
> dictionary { name : URL}? In a single pass if it could.
>
> Or how can I replace the html &entities; in a string
> "blablabla&blablabal&balbalbal" with the chars they mean using
> re.sub? I found out they are stored in an dict [from htmlentitydefs import
> htmlentitydefs]. I though about this functionality:
>
> regexp = re.compile("&[a-zA-Z];")
> regexp.sub(entitydefs[r'\1'], url)
>
> but it can't work, because the r'...' must eaten directly by the sub, and
> cannot be used so independently ( at least I think so). Any ideas? Thanks
> in advance.
>
> -i
>
> ---------------------------------------_.)--
> |  Lukas Holcik (xholcik1 at fi.muni.cz)  (\=)*
> ----------------------------------------''--
Lukas -

Here is an example script from the upcoming 1.2 release of pyparsing.  It is
certainly not a one-liner, but it should be fairly easy to follow.  (This
example makes two passes over the input, but only to show two different
output styles - the dictionary creation is done in a single pass.)

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

# URL extractor
# Copyright 2004, Paul McGuire
from pyparsing import Literal,Suppress,CharsNotIn,CaselessLiteral,\
        Word,dblQuotedString,alphanums
import urllib
import pprint

# Define the pyparsing grammar for a URL, that is:
#    URLlink ::= <a href= URL>linkText</a>
#    URL ::= doubleQuotedString | alphanumericWordPath
# Note that whitespace may appear just about anywhere in the link.  Note
also
# that it is not necessary to explicitly show this in the pyparsing grammar;
by
# default, pyparsing skips over whitespace between tokens.
linkOpenTag = (Literal("<") + "a" + "href" + "=").suppress() + \
                ( dblQuotedString | Word(alphanums+"/") ) + \
                Suppress(">")
linkCloseTag = Literal("<") + "/" + CaselessLiteral("a") + ">"
link = linkOpenTag + CharsNotIn("<") + linkCloseTag.suppress()

# Go get some HTML with some links in it.
serverListPage = urllib.urlopen( "http://www.yahoo.com" )
htmlText = serverListPage.read()
serverListPage.close()

# scanString is a generator that loops through the input htmlText, and for
each
# match yields the tokens and start and end locations (for this application,
we
# are not interested in the start and end values).
for toks,strt,end in link.scanString(htmlText):
    print toks.asList()

# Rerun scanString, but this time create a dict of text:URL key-value pairs.
# Need to reverse the tokens returned by link, using a parse action.
link.setParseAction( lambda st,loc,toks: [ toks[1], toks[0] ] )

# Create dictionary from list comprehension, assembled from each pair of
# tokens returned from a matched URL.
pprint.pprint(
    dict( [ toks for toks,strt,end in link.scanString(htmlText) ] )
    )





More information about the Python-list mailing list