[ann] CGI Link Checker 0.1

Tue Jul 13 06:52:28 EDT 2004

Dear Python users,
I am new to Python. As I learnt a bit more on coding in Python, I
decided to try out a simple project: to write a CGI script in Python to
check links on a single HTML page on the web. Although I am just a hobby
programmer, I thought I could show it to others and ask for their 
comments and suggestions. It is my first CGI script as well as my first 
Python application, so you might find the coding immature. Please do 
correct me wherever necessary.

I looked about around the net, but found only a few link-checking
details related to Python. So, I thought I could write a no-frills one
myself.

BTW the W3C Link Checker is written in Perl. I don't know Perl, so I
couldn't look at it for ideas.

I had to face the following problems:

1.Delayed responses for large pages: I worked around this by flushing
sys.stdout after every three links checked; that might lead to
inefficiency, but it does throw the results three at a time to the
impatient user. Otherwise, the Python interpreter would wait until the
output buffer is filled till dumping it to the web server's output.

2.Slow: I don't know how to make the script perform better. I've tried
to look into the code to make it run faster, but I couldn't do so. Also,
  I think the hosting server's bandwidth may contribute to this. Still,
it takes only about 5 to 10 seconds more than the W3C validator for very
large pages, and 2 to 3 seconds more for smaller ones. Your results may
vary, I'd love to know.

3.HTML parsing: I have made no attempt to (and I do not propose to)
check pages with incorrect HTML/XHTML. This means that if the Python
HTMLParser fails, my script exits gracefully. An example of invalid HTML
is http://www.yahoo.com/.

Finally, since this is my first Python program, I might not have
properly adapted to the style of programming experienced Python users
may be accustomed to. So, I request you to please correct me in this
regard as well.

In all, it was an good experience, and gave me more than a glimpse of
the power offered by Python.

Please read the instructions on the page before entering your URL to
test the script. Remember to enter the link as http:// and don't forget 
to add the slash (/) for those links which and in a directory, like

http://myserver/my/dir/

You can spawn the script from:

http://kumar.travisbsd.org/pyprogs/example.html

Personally, I have tried the following sites with this script:
http://www.w3.org/ - Works 100% perfect.
http://www.yahoo.com/ - Invalid HTML. Exits gracefully.

Source code only (meaning without the fancy images and CSS I have used):
http://kumar.travisbsd.org/pyprogs/cgilink.txt

If you want to try hosting the script on your own server, get this and
see the README (This includes all the images and fancy CSS):
http://kumar.travisbsd.org/pyprogs/cgilink-0.1.tar.gz

Thank you.
Kumar

-- 
Adayapalam Appaiah Kumaraswamy
(Kumar Appaiah)

Web: http://www.ee.iitm.ac.in/~ee03b091/