Regex help needed!

Paul McGuire ptmcg at austin.rr.com
Tue Dec 22 10:45:11 EST 2009


On Dec 21, 5:38 am, Oltmans <rolf.oltm... at gmail.com> wrote:
> Hello,. everyone.
>
> I've a string that looks something like
> ----
> lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
> =   "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
> ----
>
> From above string I need the digits within the ID attribute. For
> example, required output from above string is
> - 35343433
> - 345343
> - 8898
>
> I've written this regex that's kind of working
> re.findall("\w+\s*\W+amazon_(\d+)",str)
>

The issue with using regexen for parsing HTML is that you often get
surprised by attributes that you never expected, or out of order, or
with weird or missing quotation marks, or tags or attributes that are
in upper/lower case.  BeautifulSoup is one tool to use for HTML
scraping, here is a pyparsing example, with hopefully descriptive
comments:


from pyparsing import makeHTMLTags,ParseException

src = """
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
=   "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
hello, my age is 86 years old and I was born in 1945. Do you know
that
PI is roughly 3.1443534534534534534 """

# use makeHTMLTags to return an expression that will match
# HTML <div> tags, including attributes, upper/lower case,
# etc. (makeHTMLTags will return expressions for both
# opening and closing tags, but we only care about the
# opening one, so just use the [0]th returned item
div = makeHTMLTags("div")[0]

# define a parse action to filter only for <div> tags
# with the proper id form
def filterByIdStartingWithAmazon(tokens):
    if not tokens.id.startswith("amazon_"):
        raise ParseException(
          "must have id attribute starting with 'amazon_'")

# define a parse action that will add a pseudo-
# attribute 'amazon_id', to make it easier to get the
# numeric portion of the id after the leading 'amazon_'
def makeAmazonIdAttribute(tokens):
    tokens["amazon_id"] = tokens.id[len("amazon_"):]

# attach parse action callbacks to the div expression -
# these will be called during parse time
div.setParseAction(filterByIdStartingWithAmazon,
                     makeAmazonIdAttribute)

# search through the input string for matching <div>s,
# and print out their amazon_id's
for divtag in div.searchString(src):
    print divtag.amazon_id


Prints:

345343
35343433
8898




More information about the Python-list mailing list