Fundamental problem with urllib...

Steve Holden sholden at holdenweb.com
Sat Apr 27 12:31:15 EDT 2002


"Jeff Pitman" <bruthasj at yahoo.com> wrote in message
news:aaa5fu$2u4 at netnews.hinet.net...
> Steve Holden wrote:
>
> > "Jeremy Hylton" <jeremy at alum.mit.edu> wrote ...
> >> "A.M. Kuchling" <akuchlin at ute.mems-exchange.org> wrote ...
> >> > In article <yNUw8.74422$T%5.18813 at atlpnn01.usenetserver.com>,
> >> > Steve Holden wrote:
> >> > > Since urllib knows nothing of cookies, you will need to integrate
> >> > > some
> > sort
> >> > > of a cookie jar into the library, with a new API  for the clients
to
> >> > > retrieve and store the cookies.
>
> Or do it transparently.
>

Well, the intention was that cookie operation should be transparent, and
non-existent in the case where the default cookie jar was used. But you've
obviously divined my intent :-)

>
> >> > This is worthwhile, but I don't think it belongs in urllib.  It
> >> > belongs in a module or package of its own that provides general
> >> > Web-browser features such as cookies, remembering authentication
> >> > usernames and passwords, and a cache.  This package could then be
used
> >> > for implementing HTML-scraping scripts, spiders, or a Web browser.
>
> No, it belongs in urllib2 because it uses recursion to open websites that,
> for example, redirect from http://site/index.php to
> http://site/login_page.php to  https://site/login_page.php.  All I can say
> is "good luck!" trying to intervene outside of the package without
> rewriting AbstractHTTPHandler.
>
> >> I'm not sure what the difference between an HTTP client, like urllib
> >> or urllib2, and a Web-browser is.  Other than urllib's monolithic
> >> design, why wouldn't you want these sorts of features in the module?
>
> Exactly!
>
> > Personally I imagined passing a dictionary as an optional cookie jar
> > argument, keyed by (domain, path) tuples. The library code would update
> > this as dictated by its interactions with web sources.
>
> This is what I did:
>
>
http://sourceforge.net/tracker/index.php?func=detail&aid=548197&group_id=547
0&atid=305470
>
> And, right now, it is transparent to anyone using urllib2.  It uses a
> persistent Dict (within a script who imports urllib2) that keys on the
> hostname and stores a "Cookie" object.  This Cookie object is scraped
every
> time Cookies are sent from the server to the client.  The Cookie object is
> then consulted and its headers sent on each request to the server.
>
> Obviously this is v0.0.0.1, but I think it is a start of something
useable.
> I'm trying to create a library that "HTML-scrapes" websites and while
doing
> so I hit a brick wall with Cookies.  This library is going to be similar
to
> related perl scripts, except much cleaner.
>
> Sample screen-scrape:
>
>     ua = HTMLAgent( "http://www.yahoo.com/" )
>     ua.report()
>
>     form = ua.getFormByIndex( 0 )
>     form.fill( 'q', 'python' )
>     ua.submit( form )
>
>     ua.report()
>     ua.clickByName( 'Python Language Website' )
>
>     print "Now at", ua.location.geturl()
>
> So far so good with the library I've written so far. Except it can be slow
> as it uses minidom to parse the HTML.  And, I'm a newbie at this stuff so,
> I don't know how to limit the parse to only <form></form> tags in the
> screen-scrape process.
>
> I'll clean it up a little tonight and drop it somewhere if you want to
look
> at it.

Jeff:

I think this is excellent progress, and would love to see the code (though I
might not have time until Monday, so don't bust a gut).

If you want some code as a start for the cookies I have some lying around,
though it won't give you everything. It does, though, at least read & write
cookies. There's also an example using htmllib to extract table information
only, which might readilt yadapt to your needs (or at least give you food
for thought).

Mail me if you need either of those.

--

home: http://www.holdenweb.com/
Python Web Programming:
http://pydish.holdenweb.com/pwp/








More information about the Python-list mailing list