Problem with urllib.py

John J. Lee jjl at pobox.com
Sat Jul 24 08:09:51 EDT 2004


"Volker M." <spooky0815 at gmx.de> writes:

> I want to open a list of URLs with Pythons urllib and the fuction
> open(URL) automatically. It is important that the program open ONLY
> normal http-sites and no https-sites with user/password-request.
> So exists a possibility that I could cancel all site requests with
> user/password-dialogues?

Assuming you mean you don't want to handle Basic HTTP Authentication
(and you don't care whether http or https), you can use
urllib2.urlopen() instead of urllib.urlopen() You will then get a
urllib2.HTTPError with a .code of 401 when a site wants Basic
Authentication.

If you do mean https, though, again with urllib2:

class NullHTTPSHandler(urllib2.HTTPSHandler):
    def https_open(self, request):
        return None

o = urllib2.build_opener(NullHTTPSHandler())

response = o.open(url)


In general, urllib2 splits up the job of opening URLs into handlers,
so it's more 'turn-off-and-on-able' than urllib.

Since you're writing a robot, one other thing: the alpha version of my
ClientCookie package (urllib2-replacement with addons) contains code
for obeying robots.txt files (albeit not yet well tested, IIRC):

import ClientCookie
o = ClientCookie.build_opener(ClientCookie.HTTPRobotRulesProcessor())

response = o.open(url)


Some time soon I'll have to make a distribution of this stuff that
works properly with 2.4 (which includes changes to urllib2 from
ClientCookie)...


John



More information about the Python-list mailing list