fetching webpage

yookyung ykjo at cs.cornell.edu
Thu Dec 29 21:20:04 EST 2005


I am trying to crawl webpages in citeseer domain (a collection of research 
papers mostly in computer science).

I have used the following code snippet.

#####
import urllib

sock = urllib.urlopen("http://citeseer.ist.psu.edu")
webcontent = sock.read().split('\n')
sock.close()
print webcontent
########

Then I get the following error message.


['<!--#set var="TITLE" value="Server error!"', '--><!--#include 
virtual="include/top.html" -->', '', '  <!--#if 
expr="$REDIRECT_ERROR_NOTES" -->', '', '    The server encountered an 
internal error and was ', '    unable to complete your request.', '', ' 
<!--#include virtual="include/spacer.html" -->', '', '    Error message:', ' 
<br /><!--#echo encoding="none" var="REDIRECT_ERROR_NOTES" -->', '', ' 
<!--#else -->', '', '    The server encountered an internal error and was ', 
'    unable to complete your request. Either the server is', '    overloaded 
or there was an error in a CGI script.', '', '  <!--#endif -->', '', 
'<!--#include virtual="include/bottom.html" -->', '']

However, the url is valid and it works fine if I open the url in my web 
browser.
Or, if I use a different url (http://www.google.com  instead of 
http://citeseer.ist.psu.edu),
then it works.

What is wrong?
Could it be that the citeseer webserver checks the http request, and it sees 
something
that it doesn't like and reject the request?
What should I do?

Thank you.

Best regards,
Yookyung








More information about the Python-list mailing list