urllib2 - iteration over non-sequence

Sun Jun 10 02:47:25 EDT 2007

En Sun, 10 Jun 2007 02:54:47 -0300, Erik Max Francis <max at alcyone.com>  
escribió:

> Gary Herron wrote:
>
>> Certainly there's are cases where xreadlines or read(bytecount) are
>> reasonable, but only if the total pages size is *very* large.  But for
>> most web pages, you guys are just nit-picking (or showing off) to
>> suggest that the full read implemented by readlines is wasteful.
>> Moreover, the original problem was with sockets -- which don't have
>> xreadlines.  That seems to be a method on regular file objects.
>>
> There is absolutely no reason to read the entire file into memory (which
> is what you're doing) before processing it.  This is a good example of
> the principle of there is one obvious right way to do it -- and it isn't
> to read the whole thing in first for no reason whatsoever other than to
> avoid an `x`.

The problem is -and you appear not to have noticed that- that the object  
returned by urlopen does NOT have a xreadlines() method; and even if it  
had, a lot of pages don't contain any '\n' so using xreadlines would read  
the whole page in memory anyway.

Python 2.2 (the version that the OP is using) did include a xreadlines  
module (now defunct) but on this case it is painfully slooooooooooooow -  
perhaps it tries to read the source one character at a time.

So the best way would be to use (as Paul Rubin already said):

for line in iter(lambda: f.read(4096), ''): print line

-- 
Gabriel Genellina