[ python-Bugs-1682241 ] Problems with urllib2 read()

Fri Apr 27 05:36:58 CEST 2007

Bugs item #1682241, was opened at 2007-03-16 12:00
Message generated for change (Comment added) made by maenpaa
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1682241&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Lucas Malor (lucas_malor)
Assigned to: Nobody/Anonymous (nobody)
Summary: Problems with urllib2 read()

Initial Comment:
urllib2 objects opened with urlopen() does not have the method seek() as file objects. So reading only some bytes from opened urls is pratically forbidden.

An example: I tried to open an url and check if it's a gzip file. If IOError is raised I read the file (to do this I applied the #1675951 patch: https://sourceforge.net/tracker/index.php?func=detail&aid=1675951&group_id=5470&atid=305470 )

But after I tried to open the file as gzip, if it's not a gzip file the current position in the urllib object is on the second byte. So read() returns the data from the 3rd to the last byte. You can't check the header of the file before storing it on hd. Well, so what is urlopen() for? If I must store the file by url on hd and reload it, I can use urlretrieve() ...

----------------------------------------------------------------------

Comment By: Zacherates (maenpaa)
Date: 2007-04-26 23:36

Message:
Logged In: YES 
user_id=1421845
Originator: NO

> In my opinion it's not complicated, it's convoluted. I must use two
> object to handle one data stream.

seek() is not a stream operation. It is a random access operation
(file-like != stream). If you were only trying to use stream operations
then you wouldn't have these problems.   

Each class provides a separate functionality, urllib gets the file while
StringIO stores it.  The fact that these responsibilities are given to
different classes should not be surprising since the represent separately
useful concepts that abstract different things.  It's not convoluted, it's
good design.  If every class tried to do everything, pretty soon you're
adding solve_my_business_problem_using_SOA() to __builtins__ and nobody
wants that.

> Furthermore it's a waste of resources. I must copy data to another
> object. Luckily in my script I download and handle only little files.
But what if
> a python program must handle big files?

This is exactly why urllib *doesn't* provide seek. Deep down in the
networking library there's a socket with a 8KiB buffer talking to the HTTP
server. No matter how big the file you're getting with urllib, once that
buffer is full the socket starts dropping packets. 

To provide seek(), urllib would need to keep an entire copy of the file
that was retrieved, (or provide mark()/seek(), but those have wildly
different semantics from the seek()s were used to in python, and besides
they're too Java).  This works fine if you're only working with small
files, but you raise a good point: "But what if a python program must
handle big files?".  What about really big files (say a Knoppix DVD ISO)? 
Sure you could use urlretrieve, but what if urlretrive is implemented in
terms of urlopen?

Sure urllib could implement seek (with the same semantics as file.seek())
but that would mean breaking urllib for any resource big enough that you
don't want the whole thing in memory.

>> You can check the type of the response content before you try
>> to uncompress it via the Content-Encoding header of the
>> response

>It's not a generic solution

The point of this suggestion is not that this is the be all and end all
solution, but that code that *needs* seek can probably be rewritten so that
it does not.  Either that or you could implement BufferedReader with the
methods mark() and seek() and wrap the result of urlopen.

----------------------------------------------------------------------

Comment By: Lucas Malor (lucas_malor)
Date: 2007-04-26 16:41

Message:
Logged In: YES 
user_id=1403274
Originator: YES

In my opinion it's not complicated, it's convoluted. I must use two
object
to handle one data stream.

Furthermore it's a waste of resources. I must copy data to another
object.
Luckily in my script I download and handle only little files. But what if
a
python program must handle big files?

If seek() can't be used (an except is raised), urllib could use a
sequential access method.

----------------------------------------------------------------------

Comment By: Calvin Spealman (ironfroggy)
Date: 2007-04-26 09:55

Message:
Logged In: YES 
user_id=112166
Originator: NO

I have to agree that this is not a bug. HTTP responses are strings, not
random access files. Adding a seek would have disastrous performance
penalties. If you think the work around is too complicated, I can't
understand why.

----------------------------------------------------------------------

Comment By: Zacherates (maenpaa)
Date: 2007-03-20 21:39

Message:
Logged In: YES 
user_id=1421845
Originator: NO

> I use the method you wrote, but this must be done manually,
> and I don't know why.
read() is a stream processing method, whereas seek() is a random access
processing method.  HTTP resources are in essence streams so they implement
read() but not seek().  Trying to shoehorn a stream to act like a random
access file has some rather important technical implications.  For example:
what happens when an HTTP resource is larger than available memory and we
try to maintain a full featured seek() implementation?

> so what is urlopen() for?
Fetching a webpage or RSS feed and feeding it to a parser, for example.

StringIO is a class that was designed to implement feature complete,
random access, file-like object behavior that can be wrapped around a
stream.  StringIO can and should be used as an adapter for when you have a
stream that you need random access to.  This allows designers the freedom
to simply implement a good read() implementation and let clients wrap the
output in a StringIO if needed.

If in your application you always want random access and you don't have to
deal with large files:
def my_urlopen(*args, **kwargs):
   return StringIO.StringIO(urllib2.urlopen(*args, **kwargs).read())

Python makes delegation trivially easy.

In essence, urlfiles (the result of urllib2.urlopen()) and regular files
(the result of open()) behave differently because they implement different
interfaces.  If you use the common interface (read), then you can treat
them equally.  If you use the specialized interface (seek, tell, etc.)
you'll have trouble.  The solution is wrap the general objects in a
specialized object that implements the desired interface, StringIO.

----------------------------------------------------------------------

Comment By: Lucas Malor (lucas_malor)
Date: 2007-03-20 04:59

Message:
Logged In: YES 
user_id=1403274
Originator: YES

> If you need to seek, you can wrap the file-like object in a
> StringIO (which is what urllib would have to do internally
> [...] )

I think it's really a bug, or at least a non-pythonic method.
I use the method you wrote, but this must be done manually,
and I don't know why. Actually without this "trick" you can't
handle url and file objects together as they don't work in
the same manner. I think it's not too complicated using the
internal StringIO object in urllib class when I must seek()
or use other file-like methods.

> You can check the type of the response content before you try
> to uncompress it via the Content-Encoding header of the
> response

It's not a generic solution

(thanks anyway for suggested solutions :) )

----------------------------------------------------------------------

Comment By: Zacherates (maenpaa)
Date: 2007-03-19 22:43

Message:
Logged In: YES 
user_id=1421845
Originator: NO

I'd contend that this is not a bug:
 * If you need to seek, you can wrap the file-like object in a StringIO
(which is what urllib would have to do internally, thus incurring the
StringIO overhead for all clients, even those that don't need the
functionality).
 * You can check the type of the response content before you try to
uncompress it via the Content-Encoding header of the response.  The
meta-data is there for a reason.

Check
http://www.diveintopython.org/http_web_services/gzip_compression.html for a
rather complete treatment of your use-case.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1682241&group_id=5470