Help with parsing web page

Tue Jun 15 10:07:50 EDT 2004

"RiGGa" <rigga at hasnomail.com> wrote in message
news:aFkzc.15232$NK4.2386151 at stones.force9.net...
> Hi,
>
> I want to parse a web page in Python and have it write certain values out
to
> a mysql database.  I really dont know where to start with parsing the html
> code ( I can work out the database part ).  I have had a look at htmllib
> but I need more info. Can anyone point me in the right direction , a
> tutorial or something would be great.
>
> Many thanks
>
> RiGga
>
RiGGa -

The following program is included in the examples shipped with pyparsing.
This uses a slightly different technique than working with a complete HTML
parser - instead, it scans the input HTML for an expected pattern, and
extracts it (and several named subfields).  You can accomplish this same
behavior using regular expressions, but you might find pyparsing a bit
easier to read.

This program uses urllib to capture the HTML from NIST's time server web
site, then scans the HTML for NTP servers.  The expected pattern is:

<td>ip-address</td><td>arbitrary text giving server location</td>

For example:
<td>132.163.4.101</td>
   <td>NIST, Boulder, Colorado</td>

(pyparsing ignores whitespace, so the line breaks and tabs are not a
concern.  If you convert to regexp's, you need to add re fields for the
whitespace.)

The output from running this program gives:
129.6.15.28 - NIST, Gaithersburg, Maryland
129.6.15.29 - NIST, Gaithersburg, Maryland
132.163.4.101 - NIST, Boulder, Colorado
132.163.4.102 - NIST, Boulder, Colorado
132.163.4.103 - NIST, Boulder, Colorado
128.138.140.44 - University of Colorado, Boulder
192.43.244.18 - NCAR, Boulder, Colorado
131.107.1.10 - Microsoft, Redmond, Washington
69.25.96.13 - Symmetricom, San Jose, California
216.200.93.8 - Abovenet, Virginia
208.184.49.9 - Abovenet, New York City
207.126.98.204 - Abovenet, San Jose, California
207.200.81.113 - TrueTime, AOL facility, Sunnyvale, California
64.236.96.53 - TrueTime, AOL facility, Virginia

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

# getNTPservers.py
#
# Demonstration of the parsing module, implementing a HTML page scanner,
# to extract a list of NTP time servers from the NIST web site.
#
# Copyright 2004, by Paul McGuire
#
from pyparsing import Word, Combine, Suppress, CharsNotIn, nums
import urllib

integer = Word(nums)
ipAddress = Combine( integer + "." + integer + "." + integer + "." +
integer )
tdStart = Suppress("<td>")
tdEnd = Suppress("</td>")
timeServerPattern =  tdStart + ipAddress.setResultsName("ipAddr") + tdEnd +
\
        tdStart + CharsNotIn("<").setResultsName("loc") + tdEnd

# get list of time servers
nistTimeServerURL =
"http://www.boulder.nist.gov/timefreq/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()

addrs = {}
for srvr,startloc,endloc in timeServerPattern.scanString( serverListHTML ):
    print srvr.ipAddr, "-", srvr.loc
    addrs[srvr.ipAddr] = srvr.loc
    # or do this:
    #~ addr,loc = srvr
    #~ print addr, "-", loc