[ann] CGI Link Checker 0.1

Tue Jul 13 10:59:28 EDT 2004

On Tue, 13 Jul 2004, Adayapalam Appaiah Kumaraswamy wrote:

> Dear Python users,
> I am new to Python. As I learnt a bit more on coding in Python, I
> decided to try out a simple project: to write a CGI script in Python to
> check links on a single HTML page on the web. Although I am just a hobby
> programmer, I thought I could show it to others and ask for their 
> comments and suggestions. It is my first CGI script as well as my first 
> Python application, so you might find the coding immature. Please do 
> correct me wherever necessary.

First off, good job for a first script! The interface looks very 
professional, and your code is very clean.

> I had to face the following problems:
> 
> 1.Delayed responses for large pages: I worked around this by flushing
> sys.stdout after every three links checked; that might lead to
> inefficiency, but it does throw the results three at a time to the
> impatient user. Otherwise, the Python interpreter would wait until the
> output buffer is filled till dumping it to the web server's output.

You could probably flush the buffer after each link is checked; this 
shouldn't cause any noticable overhead (the time spent checking the links 
will greatly overshadow the time spent flushing the buffer), but that's 
assuming the web server doesn't do any per-buffer-flush processing (which 
it might, if you are using server-side-includes).

> 2.Slow: I don't know how to make the script perform better. I've tried
> to look into the code to make it run faster, but I couldn't do so.

For the same reason as above (time is spent mostly checking the links) I
don't think tweaking the code will help much in this case. I was going to 
suggest checking if urllib2 uses read-ahead buffering, but a quick check 
reveals it doesn't do any... perhaps the culprit is in the HTML parsing?

> 3.HTML parsing: I have made no attempt to (and I do not propose to)
> check pages with incorrect HTML/XHTML. This means that if the Python
> HTMLParser fails, my script exits gracefully. An example of invalid HTML
> is http://www.yahoo.com/.

I've seen the BeautifulSoup module recommended before as a parser that
will gracefully handle malformed HTML. It may even be faster than
HTMLParser (but this is just a guess).  The homepage is
http://www.crummy.com/software/BeautifulSoup/, but it doesn't seem to be
up right now.

> Finally, since this is my first Python program, I might not have
> properly adapted to the style of programming experienced Python users
> may be accustomed to. So, I request you to please correct me in this
> regard as well.

No corrections needed :)