begin to parse a web page not entirely downloaded
Björn Steinbrink
B.Steinbrink at gmx.de
Thu Feb 8 14:06:34 EST 2007
On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:
> On Feb 8, 6:54 pm, Leif K-Brooks <eurl... at ecritters.biz> wrote:
>> k0mp wrote:
>> > Is there a way to retrieve a web page and before it is entirely
>> > downloaded, begin to test if a specific string is present and if yes
>> > stop the download ?
>> > I believe that urllib.openurl(url) will retrieve the whole page before
>> > the program goes to the next statement.
>>
>> Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>>
>> >>> foo = urllib.urlopen('http://google.com')
>> >>> foo.read(512)
>> '<html><head> ...
>>
>> foo.read(512) will return as soon as 512 bytes have been received. You
>> can keep caling it until it returns an empty string, indicating that
>> there's no more data to be read.
>
> Thanks for your answer :)
>
> I'm not sure that read() works as you say.
> Here is a test I've done :
>
> import urllib2
> import re
> import time
>
> CHUNKSIZE = 1024
>
> print 'f.read(CHUNK)'
> print time.clock()
>
> for i in range(30) :
> f = urllib2.urlopen('http://google.com')
> while True: # read the page using a loop
> chunk = f.read(CHUNKSIZE)
> if not chunk: break
> m = re.search('<html>', chunk )
> if m != None :
> break
>
> print time.clock()
>
> print
>
> print 'f.read()'
> print time.clock()
> for i in range(30) :
> f = urllib2.urlopen('http://google.com')
> m = re.search('<html>', f.read() )
> if m != None :
> break
A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().
Björn
More information about the Python-list
mailing list