begin to parse a web page not entirely downloaded

k0mp Michel.Al1 at gmail.com
Thu Feb 8 13:20:56 EST 2007


On Feb 8, 6:54 pm, Leif K-Brooks <eurl... at ecritters.biz> wrote:
> k0mp wrote:
> > Is there a way to retrieve a web page and before it is entirely
> > downloaded, begin to test if a specific string is present and if yes
> > stop the download ?
> > I believe that urllib.openurl(url) will retrieve the whole page before
> > the program goes to the next statement.
>
> Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>
>  >>> foo = urllib.urlopen('http://google.com')
>  >>> foo.read(512)
> '<html><head> ...
>
> foo.read(512) will return as soon as 512 bytes have been received. You
> can keep caling it until it returns an empty string, indicating that
> there's no more data to be read.

Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
    f = urllib2.urlopen('http://google.com')
    while True:               # read the page using a loop
        chunk = f.read(CHUNKSIZE)
        if not chunk: break
        m = re.search('<html>', chunk )
        if m != None :
            break

print time.clock()

print

print 'f.read()'
print time.clock()
for i in range(30) :
    f = urllib2.urlopen('http://google.com')
    m = re.search('<html>', f.read() )
    if m != None :
        break

print time.clock()


It prints that :
f.read(CHUNK)
0.1
0.31

f.read()
0.31
0.32


It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.




More information about the Python-list mailing list