begin to parse a web page not entirely downloaded

Thu Feb 8 14:06:34 EST 2007

On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:

> On Feb 8, 6:54 pm, Leif K-Brooks <eurl... at ecritters.biz> wrote:
>> k0mp wrote:
>> > Is there a way to retrieve a web page and before it is entirely
>> > downloaded, begin to test if a specific string is present and if yes
>> > stop the download ?
>> > I believe that urllib.openurl(url) will retrieve the whole page before
>> > the program goes to the next statement.
>>
>> Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>>
>>  >>> foo = urllib.urlopen('http://google.com')
>>  >>> foo.read(512)
>> '<html><head> ...
>>
>> foo.read(512) will return as soon as 512 bytes have been received. You
>> can keep caling it until it returns an empty string, indicating that
>> there's no more data to be read.
> 
> Thanks for your answer :)
> 
> I'm not sure that read() works as you say.
> Here is a test I've done :
> 
> import urllib2
> import re
> import time
> 
> CHUNKSIZE = 1024
> 
> print 'f.read(CHUNK)'
> print time.clock()
> 
> for i in range(30) :
>     f = urllib2.urlopen('http://google.com')
>     while True:               # read the page using a loop
>         chunk = f.read(CHUNKSIZE)
>         if not chunk: break
>         m = re.search('<html>', chunk )
>         if m != None :
>             break
> 
> print time.clock()
> 
> print
> 
> print 'f.read()'
> print time.clock()
> for i in range(30) :
>     f = urllib2.urlopen('http://google.com')
>     m = re.search('<html>', f.read() )
>     if m != None :
>         break

A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().

Björn