How do I enter/receive webpage information?

John J. Lee jjl at pobox.com
Sat Feb 5 17:58:52 EST 2005


Jorgen Grahn <jgrahn-nntq at algonet.se> writes:
[...]
> I did it this way successfully once ... it's probably the wrong approach in 
> some ways, but It Works For Me.
> 
> - used httplib.HTTPConnection for the HTTP parts, building my own requests
>   with headers and all, calling h.send() and h.getresponse() etc.
> 
> - created my own cookie container class (because there was a session
>   involved, and logging in and such things, and all of it used cookies)
> 
> - subclassed sgmllib.SGMLParser once for each kind of page I expected to
>   receive. This class knew how to pull the information from a HTML document,
>   provided it looked as I expected it to.  Very tedious work. It can be easier
>   and safer to just use module re in some cases.
> 
> Wrapped in classes this ended up as (fictive):
> 
> client = Client('somehost:80)
> client.login('me', 'secret)
> a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
> foo = theFoo(client, 'yesterday')
> 
> I had to look deeply into the HTTP RFCs to do this, and also snoop the
> traffic for a "real" session to see what went on between server and client.

I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug.  It's easy to see and modify
urllib2's headers if you need to get low level.

One starting point for web scraping with Python:

http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html

There are some modules you may find useful there, too.

Google Groups for urlencode.  Or use my module ClientForm, if you
prefer.  Experiment a little with an HTML form in a local file and
(eg.) the 'ethereal' sniffer to see what happens when you click
submit.

The stdlib now has cookie support (in Python 2.4):

import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

r = opener.open("http://example.com/")
print r.read()

Unfortunately, it's true that network sniffing and a reasonable
smattering of knowledge about HTTP &c., does often turn out to be
necessary to scrape stuff.  A few useful tips:

http://wwwsearch.sourceforge.net/ClientCookie/doc.html#debugging


John



More information about the Python-list mailing list