begin to parse a web page not entirely downloaded

Thu Feb 8 14:26:39 EST 2007

On Feb 8, 8:06 pm, Björn Steinbrink <B.Steinbr... at gmx.de> wrote:
> On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:
> > On Feb 8, 6:54 pm, Leif K-Brooks <eurl... at ecritters.biz> wrote:
> >> k0mp wrote:
> >> > Is there a way to retrieve a web page and before it is entirely
> >> > downloaded, begin to test if a specific string is present and if yes
> >> > stop the download ?
> >> > I believe that urllib.openurl(url) will retrieve the whole page before
> >> > the program goes to the next statement.
>
> >> Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>
> >>  >>> foo = urllib.urlopen('http://google.com')
> >>  >>> foo.read(512)
> >> '<html><head> ...
>
> >> foo.read(512) will return as soon as 512 bytes have been received. You

> >> can keep caling it until it returns an empty string, indicating that
> >> there's no more data to be read.
>
> > Thanks for your answer :)
>
> > I'm not sure that read() works as you say.
> > Here is a test I've done :
>
> > import urllib2
> > import re
> > import time
>
> > CHUNKSIZE = 1024
>
> > print 'f.read(CHUNK)'
> > print time.clock()
>
> > for i in range(30) :
> >     f = urllib2.urlopen('http://google.com')
> >     while True:               # read the page using a loop
> >         chunk = f.read(CHUNKSIZE)
> >         if not chunk: break
> >         m = re.search('<html>', chunk )
> >         if m != None :
> >             break
>
> > print time.clock()
>
> > print
>
> > print 'f.read()'
> > print time.clock()
> > for i in range(30) :
> >     f = urllib2.urlopen('http://google.com')
> >     m = re.search('<html>', f.read() )
> >     if m != None :
> >         break
>
> A fair comparison would use "pass" here. Or a while loop as in the
> other case. The way it is, it compares 30 times read(CHUNKSIZE)
> against one time read().
>
> Björn

That's right my test was false. I've replaced http://google.com with
http://aol.com
And the 'break' in the second loop with 'continue' ( because when the
string is found I don't want the rest of the page to be parsed.

I obtain this :
f.read(CHUNK)
0.1
0.17

f.read()
0.17
0.23

f.read() is still faster than f.read(CHUNK)