A problem while using urllib

Wed Oct 12 04:25:05 EDT 2005

Johnny Lee wrote:
> Steve Holden wrote:
> 
>>Johnny Lee wrote:
>>
>>>Alex Martelli wrote:
>>>
>>>
>>>>Johnny Lee <johnnyandfiona at hotmail.com> wrote:
>>>>  ...
>>>>
>>>>
>>>>>  try:
>>>>>     webPage = urllib2.urlopen(url)
>>>>>  except urllib2.URLError:
>>>>
>>>>  ...
>>>>
>>>>
>>>>>  webPage.close()
>>>>>  return True
>>>>>----------------------------------------------------
>>>>>
>>>>>  But every time when I ran to the 70 to 75 urls (that means 70-75
>>>>>urls have been tested via this way), the program will crash and all the
>>>>>urls left will raise urllib2.URLError until the program exits. I tried
>>>>>many ways to work it out, using urllib, set a sleep(1) in the filter (I
>>>>>thought it was the massive urls crashed the program). But none works.
>>>>>BTW, if I set the url from which the program crashed to base url, the
>>>>>program will still crashed at the 70-75 url. How can I solve this
>>>>>problem? thanks for your help
>>>>
>>>>Sure looks like a resource leak somewhere (probably leaving a file open
>>>>until your program hits some wall of maximum simultaneously open files),
>>>>but I can't reproduce it here (MacOSX, tried both Python 2.3.5 and
>>>>2.4.1).  What version of Python are you using, and on what platform?
>>>>Maybe a simple Python upgrade might fix your problem...
>>>>
>>>>
>>>>Alex
>>>
>>>
>>>Thanks for the info you provided. I'm using 2.4.1 on cygwin of WinXP.
>>>If you want to reproduce the problem, I can send the source to you.
>>>
>>>This morning I found that this is caused by urllib2. When I use urllib
>>>instead of urllib2, it won't crash any more. But the matters is that I
>>>want to catch the HTTP 404 Error which is handled by FancyURLopener in
>>>urllib.open(). So I can't catch it.
>>>
>>
>>I'm using exactly that configuration, so if you let me have that source
>>I could take a look at it for you.
>>
[...]
> 
> I've sent the source, thanks for your help.
> 
[...]
Preliminary result, in case this rings bells with people who use urllib2 
quite a lot. I modified the error case to report the actual message 
returned with the exception and I'm seeing things like:

http://www.holdenweb.com/./Python/webframeworks.html
    Message: <urlopen error (120, 'Operation already in progress')>
Start process 
http://www.amazon.com/exec/obidos/ASIN/0596001886/steveholden-20
Error: IOError while parsing 
http://www.amazon.com/exec/obidos/ASIN/0596001886/steveholden-20
    Message: <urlopen error (120, 'Operation already in progress')>
    .
    .
    .

So at least we know now what the error is, and it looks like some sort 
of resource limit (though why only on Cygwin betas me) ... anyone, 
before I start some serious debugging?

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC                     www.holdenweb.com
PyCon TX 2006                  www.python.org/pycon/