Parsing HTML

Paul McGuire ptmcg at austin.rr.com
Sun Feb 11 17:34:28 EST 2007


On Feb 10, 5:03 pm, "mtuller" <mitul... at gmail.com> wrote:
> Alright. I have tried everything I can find, but am not getting
> anywhere. I have a web page that has data like this:
>
> <tr >
> <td headers="col1_1"  style="width:21%"   >
> <span  class="hpPageText" >LETTER</span></td>
> <td headers="col2_1"  style="width:13%; text-align:right"   >
> <span  class="hpPageText" >33,699</span></td>
> <td headers="col3_1"  style="width:13%; text-align:right"   >
> <span  class="hpPageText" >1.0</span></td>
> <td headers="col4_1"  style="width:13%; text-align:right"   >
> </tr>
>
> What is show is only a small section.
>
> I want to extract the 33,699 (which is dynamic) and set the value to a
> variable so that I can insert it into a database. I have tried parsing
> the html with pyparsing, and the examples will get it to print all
> instances with span, of which there are a hundred or so when I use:
>
> for srvrtokens in printCount.searchString(printerListHTML):
>         print srvrtokens
>
> If I set the last line to srvtokens[3] I get the values, but I don't
> know grab a single line and then set that as a variable.
>

So what you are saying is that you need to make your pattern more
specific.  So I suggest adding these items to your matching pattern:
- only match span if inside a <td> with attribute 'headers="col2_1"'
- only match if the span body is an integer (with optional comma
separater for thousands)

This grammar adds these more specific tests for matching the input
HTML (note also the use of results names to make it easy to extract
the integer number, and a parse action added to integer to convert the
'33,699' string to the integer 33699).

-- Paul


htmlSource = """<tr >
<td headers="col1_1"  style="width:21%"   >
<span  class="hpPageText" >LETTER</span></td>
<td headers="col2_1"  style="width:13%; text-align:right"   >
<span  class="hpPageText" >33,699</span></td>
<td headers="col3_1"  style="width:13%; text-align:right"   >
<span  class="hpPageText" >1.0</span></td>
<td headers="col4_1"  style="width:13%; text-align:right"   >
</tr>"""

from pyparsing import makeHTMLTags, Word, nums, ParseException

tdStart, tdEnd     = makeHTMLTags('td')
spanStart, spanEnd = makeHTMLTags('span')

def onlyAcceptWithTagAttr(attrname,attrval):
    def action(tagAttrs):
        if not(attrname in tagAttrs and tagAttrs[attrname]==attrval):
            raise ParseException("",0,"")
    return action

tdStart.setParseAction(onlyAcceptWithTagAttr("headers","col2_1"))
spanStart.setParseAction(onlyAcceptWithTagAttr("class","hpPageText"))

integer = Word(nums,nums+',')
integer.setParseAction(lambda t:int("".join(c for c in t[0] if c !=
',')))

patt = tdStart + spanStart + integer.setResultsName("intValue") +
spanEnd + tdEnd

for matches in patt.searchString(htmlSource):
    print matches.intValue

prints:
33699





More information about the Python-list mailing list