blocking forever with urllib

Fri Aug 31 03:56:41 EDT 2001

On Thu, 30 Aug 2001, Michael P. Soulier wrote:
>     I'm writing a web crawler as an exercise, using urllib and htmllib to
> recursively crawl through the pages. Whenever urllib.urlopen() throws an
> IOError exception the url gets flagged as a broken link.
>
>     Unfortunately, urllib.urlopen() is blocking for some time on one URL. When
> I do an nslookup on it, it times out within a few seconds, since it's a URL
> from our internal intranet at work and is not accessible from the internet.
> However, urllib.urlopen() takes forever to return.
>
>     Is there a way to specify a timeout for this library? I can't find a way
> in the documentation.

   There is no. This is actually TCP/IP problem. There are many solutions:
signals (unfortunately, signals are sinchronous in Python), multiprocessing
(forking or threading), non-blocking sockets (select/poll, asyncore).

   You may be interested to look into my (similar) project. The following
is detailed announce:

                    Bookmarks Database and Internet Robot

WHAT IS IT
   There is a set of classes, libraries, programs and plugins I use to
manipulate my bookmarks.html. I like Netscape Navigator, but I need more
features, so I write and maintain these programs for my needs. I need to
extend Navigator's "What's new" feature (Navigator 4 calls it "Update
bookmarks").

WHAT'S NEW in version 3.2.1
   Minor imrovement in robot wrapper check_urls.py - it sums up all traffic and
reports it at the end of the run.

WHAT'S NEW in version 3.0
   Complete rewrite from scratch. Created mechanism for pluggable storage
managers, writers (DB dumpers/exporters) and robots.

WHERE TO GET
   Master site: http://phd.pp.ru/Software/Python/#bookmarks_db

   Faster mirrors: http://phd2.chat.ru/Software/Python/#bookmarks_db
   http://www.crosswinds.net/~phd2/Software/Python/#bookmarks_db

AUTHOR
   Oleg Broytmann <phd2 at earthling.net>

COPYRIGHT
   Copyright (C) 1997-2001 PhiloSoft Design

LICENSE
   GPL

STATUS
   Storage managers: pickle, FLAD (Flat ASCII Database).
   Writers: HTML, text, FLAD (full database or only errors).
   Robots (URL checker): simple, simple+timeoutscoket, forking.

TODO
   Documentation.

   Merge "writers" to storage managers.
   New storage managers: shelve, SQL, ZODB, MetaKit.
   Robots (URL checkers): threading, asyncore-based.
   Aliases in bookmarks.html.

   Configuration file for configuring defaults - global defaults for the system
   and local defaults for subsystems.

   Parse downloaded file and get some additional information out of headers
   and parsed data - title, for example. Or redirects using <META HTTP-Equiv>.

   Ruleset-based mechanisms to filter out what types of URLs to check: checking
   based on URL schema, host, port, path, filename, extension, etc.

   Detailed reports on robot run - what's old, what's new, what was moved,
   errors, etc.
   WWW-interface to the report.

   Bigger database. Multiuser database. Robot should operate on a part of
   the DB.
   WWW-interface to the database. User will import/export/edit bookmarks,
   schedule robot run, etc.

Oleg.
----
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.