blocking forever with urllib
Oleg Broytmann
phd at phd.pp.ru
Fri Aug 31 03:56:41 EDT 2001
On Thu, 30 Aug 2001, Michael P. Soulier wrote:
> I'm writing a web crawler as an exercise, using urllib and htmllib to
> recursively crawl through the pages. Whenever urllib.urlopen() throws an
> IOError exception the url gets flagged as a broken link.
>
> Unfortunately, urllib.urlopen() is blocking for some time on one URL. When
> I do an nslookup on it, it times out within a few seconds, since it's a URL
> from our internal intranet at work and is not accessible from the internet.
> However, urllib.urlopen() takes forever to return.
>
> Is there a way to specify a timeout for this library? I can't find a way
> in the documentation.
There is no. This is actually TCP/IP problem. There are many solutions:
signals (unfortunately, signals are sinchronous in Python), multiprocessing
(forking or threading), non-blocking sockets (select/poll, asyncore).
You may be interested to look into my (similar) project. The following
is detailed announce:
Bookmarks Database and Internet Robot
WHAT IS IT
There is a set of classes, libraries, programs and plugins I use to
manipulate my bookmarks.html. I like Netscape Navigator, but I need more
features, so I write and maintain these programs for my needs. I need to
extend Navigator's "What's new" feature (Navigator 4 calls it "Update
bookmarks").
WHAT'S NEW in version 3.2.1
Minor imrovement in robot wrapper check_urls.py - it sums up all traffic and
reports it at the end of the run.
WHAT'S NEW in version 3.0
Complete rewrite from scratch. Created mechanism for pluggable storage
managers, writers (DB dumpers/exporters) and robots.
WHERE TO GET
Master site: http://phd.pp.ru/Software/Python/#bookmarks_db
Faster mirrors: http://phd2.chat.ru/Software/Python/#bookmarks_db
http://www.crosswinds.net/~phd2/Software/Python/#bookmarks_db
AUTHOR
Oleg Broytmann <phd2 at earthling.net>
COPYRIGHT
Copyright (C) 1997-2001 PhiloSoft Design
LICENSE
GPL
STATUS
Storage managers: pickle, FLAD (Flat ASCII Database).
Writers: HTML, text, FLAD (full database or only errors).
Robots (URL checker): simple, simple+timeoutscoket, forking.
TODO
Documentation.
Merge "writers" to storage managers.
New storage managers: shelve, SQL, ZODB, MetaKit.
Robots (URL checkers): threading, asyncore-based.
Aliases in bookmarks.html.
Configuration file for configuring defaults - global defaults for the system
and local defaults for subsystems.
Parse downloaded file and get some additional information out of headers
and parsed data - title, for example. Or redirects using <META HTTP-Equiv>.
Ruleset-based mechanisms to filter out what types of URLs to check: checking
based on URL schema, host, port, path, filename, extension, etc.
Detailed reports on robot run - what's old, what's new, what was moved,
errors, etc.
WWW-interface to the report.
Bigger database. Multiuser database. Robot should operate on a part of
the DB.
WWW-interface to the database. User will import/export/edit bookmarks,
schedule robot run, etc.
Oleg.
----
Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru
Programmers don't die, they just GOSUB without RETURN.
More information about the Python-list
mailing list