Python web client anyone?
Ng Pheng Siong
ngps at madcap.dyndns.org
Mon Oct 15 10:14:34 EDT 2001
According to Paul Rubin <phr-n2001d at nightsong.com>:
> What I *really* want is to be able to easily find link objects
> (anchor tags) based on the anchor text, which LWP for some reason
> doesn't do, but DOM extraction would be a start. By "anchor text" I
> mean the text in <a href=blah.html>this is the anchor text</a>. The
> client should be able to find some "underlined" text on the page it
> retrieves, and "click" on the linked document.
Surely, you find the tags by parsing "<a href=blah.html>" (sic), not by
looking for "this is the anchor text"?
> I may not have read the htmllib docs carefuly enough but it looks more
> intended for formatting/displaying HTML than parsing it. Are your
> DOM extensions available?
htmllib parses fine enough. Here's a demo from M2Crypto. It seems to work,
too. ;-)
"""M2Crypto.SSL.Session client demo: This program requests a URL from
a HTTPS server, saves the negotiated SSL session id, parses the HTML
returned by the server, then requests each HREF in a separate thread
using the saved SSL session id.
Copyright (c) 1999-2000 Ng Pheng Siong. All rights reserved."""
RCS_id='$Id: sess.py,v 1.2 2000/09/11 14:52:29 ngps Exp ngps $'
from M2Crypto import Err, Rand, SSL, X509, threading
m2_threading = threading; del threading
import formatter, getopt, htmllib, sys
from threading import Thread
from socket import gethostname
def handler(sslctx, host, port, href, recurs=0, sslsess=None):
s = SSL.Connection(sslctx)
if sslsess:
s.set_session(sslsess)
s.connect((host, port))
else:
s.connect((host, port))
sslsess = s.get_session()
#print sslsess.as_text()
if recurs:
p = htmllib.HTMLParser(formatter.NullFormatter())
f = s.makefile("rw")
f.write(href)
f.flush()
while 1:
data = f.read()
if not data:
break
if recurs:
p.feed(data)
if recurs:
p.close()
f.close()
if recurs:
for a in p.anchorlist:
req = 'GET %s HTTP/1.0\r\n\r\n' % a
thr = Thread(target=handler,
args=(sslctx, host, port, req, recurs-1, sslsess))
print "Thread =", thr.getName()
thr.start()
if __name__ == '__main__':
m2_threading.init()
Rand.load_file('../randpool.dat', -1)
host = '127.0.0.1'
port = 443
req = '/'
optlist, optarg = getopt.getopt(sys.argv[1:], 'h:p:r:')
for opt in optlist:
if '-h' in opt:
host = opt[1]
elif '-p' in opt:
port = int(opt[1])
elif '-r' in opt:
req = opt[1]
ctx = SSL.Context('sslv3')
ctx.load_cert('client.pem')
ctx.load_verify_info('ca.pem')
ctx.load_client_ca('ca.pem')
ctx.set_verify(SSL.verify_none, 10)
req = 'GET %s HTTP/1.0\r\n\r\n' % req
start = Thread(target=handler, args=(ctx, host, port, req, 1))
print "Thread =", start.getName()
start.start()
start.join()
m2_threading.cleanup()
Rand.save_file('../randpool.dat')
--
Ng Pheng Siong <ngps at post1.com> * http://www.post1.com/home/ngps
More information about the Python-list
mailing list