begin to parse a web page not entirely downloaded

Thu Feb 8 18:28:20 EST 2007

On Feb 8, 6:20 pm, "k0mp" <Michel.... at gmail.com> wrote:
> On Feb 8, 6:54 pm, Leif K-Brooks <eurl... at ecritters.biz> wrote:
>
>
>
> > k0mp wrote:
> > > Is there a way to retrieve a web page and before it is entirely
> > > downloaded, begin to test if a specific string is present and if yes
> > > stop the download ?
> > > I believe that urllib.openurl(url) will retrieve the whole page before
> > > the program goes to the next statement.
>
> > Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>
> >  >>> foo = urllib.urlopen('http://google.com')
> >  >>> foo.read(512)
> > '<html><head> ...
>
> > foo.read(512) will return as soon as 512 bytes have been received. You
> > can keep caling it until it returns an empty string, indicating that
> > there's no more data to be read.
>
> Thanks for your answer :)
>
> I'm not sure that read() works as you say.
> Here is a test I've done :
>
> import urllib2
> import re
> import time
>
> CHUNKSIZE = 1024
>
> print 'f.read(CHUNK)'
> print time.clock()
>
> for i in range(30) :
>     f = urllib2.urlopen('http://google.com')
>     while True:               # read the page using a loop
>         chunk = f.read(CHUNKSIZE)
>         if not chunk: break
>         m = re.search('<html>', chunk )
>         if m != None :
>             break
>
[snip]
I'd just like to point out that the above code assumes that the
'<html>' is entirely within one chunk; it could in fact be split
across chunks.