Parsing HTML

mtuller mituller at gmail.com
Thu Feb 8 14:38:14 EST 2007


I am trying to parse a webpage and extract information. I am trying to
use pyparser. Here is what I have:

from pyparsing import *
import urllib

# define basic text pattern
spanStart = Literal('<span class=\"hpPageText\">')

spanEnd = Literal('</span></td>')

printCount = spanStart + SkipTo(spanEnd) + spanEnd

# get printer addresses
printerURL = "http://printer.mydomain.com/hp/device/this.LCDispatcher?
nav=hp.Usage"
printerListPage = urllib.urlopen(printerURL)
printerListHTML = printerListPage.read()
printerListPage.close

for srvrtokens,startloc,endloc in
printCount.scanString(printerListHTML): print srvrtokens

print printCount


I have the last print statement to check what is being sent because I
am getting nothing back. What it sends is:
{"<span class="hpPageText">" SkipTo:("</span></td>") "</span></td>"}

If I pull out the "hpPageText" I get results back, but more than what
I want. I know it has something to do with escaping the quotation
marks, but I am puzzled as to how to do it.


Thanks,

Mike




More information about the Python-list mailing list