urllib2 performance on windows, usb connection

Fri Feb 6 21:28:09 EST 2009

dq wrote:
> MRAB wrote:
>> dq wrote:
>>> dq wrote:
>>>> MRAB wrote:
>>>>> dq wrote:
>>>>>> Martin v. Löwis wrote:
>>>>>>>> So does anyone know what the deal is with this?  Why is the same 
>>>>>>>> code so much slower on Windows?  Hope someone can tell me before 
>>>>>>>> a holy war erupts :-)
>>>>>>>
>>>>>>> Only the holy war can give an answer here. It certainly has
>>>>>>>  *nothing* to do with Python; Python calls the operating system 
>>>>>>> functions to read from the network and write to the disk almost 
>>>>>>> directly. So it must be the operating system itself that slows it 
>>>>>>> down.
>>>>>>>
>>>>>>> To investigate further, you might drop the write operating,
>>>>>>>  and measure only source.read(). If that is slower, then, for 
>>>>>>> some reason, the network speed is bad on Windows. Maybe
>>>>>>>  you have the network interfaces misconfigured? Maybe you are 
>>>>>>> using wireless on Windows, but cable on Linux? Maybe you have 
>>>>>>> some network filtering software running on Windows? Maybe it's 
>>>>>>> just that Windows sucks?-)
>>>>>>>
>>>>>>> If the network read speed is fine, but writing slows down,
>>>>>>>  I ask the same questions. Perhaps you have some virus scanner 
>>>>>>> installed that filters all write operations? Maybe
>>>>>>>  Windows sucks?
>>>>>>>
>>>>>>> Regards, Martin
>>>>>>>
>>>>>>
>>>>>> Thanks for the ideas, Martin.  I ran a couple of experiments
>>>>>>  to find the culprit, by downloading the same 20 MB file from
>>>>>>  the same fast server. I compared:
>>>>>>
>>>>>> 1.  DL to HD vs USB iPod. 2.  AV on-access protection on vs.
>>>>>>  off 3.  "source. read()" only vs.  "file.write(
>>>>>> source.read() )"
>>>>>>
>>>>>> The culprit is definitely the write speed on the iPod.  That is, 
>>>>>> everything runs plenty fast (~1 MB/s down) as long as I'm
>>>>>> not writing directly to the iPod.  This is kind of odd, because if 
>>>>>> I copy the file over from the HD to the iPod using
>>>>>>  windows (drag-n-drop), it takes about a second or two, so about 
>>>>>> 10 MB/s.
>>>>>>
>>>>>> So the problem is definitely partially Windows, but it also seems 
>>>>>> that Python's file.write() function is not without blame. It's the 
>>>>>> combination of Windows, iPod and Python's data stream that is 
>>>>>> slowing me down.
>>>>>>
>>>>>> I'm not really sure what I can do about this.  I'll experiment a 
>>>>>> little more and see if there's any way around this bottleneck.  If 
>>>>>> anyone has run into a problem like this,
>>>>>>  I'd love to hear about it...
>>>>>>
>>>>> You could try copying the file to the iPod using the command line, 
>>>>> or copying data from disk to iPod in, say, C, anything but Python. 
>>>>> This would allow you to identify whether Python itself has anything 
>>>>> to do with it.
>>>>
>>>> Well, I think I've partially identified the problem. target.write( 
>>>> source.read() ) runs perfectly fast, copies 20 megs
>>>>  in about a second, from HD to iPod.  However, if I run the same
>>>>  code in a while loop, using a certain block size, say target.write( 
>>>> source.read(4096) ), it takes forever (or at least
>>>>  I'm still timing it while I write this post).
>>>>
>>>> The mismatch seems to be between urllib2's block size and the write 
>>>> speed of the iPod, I might try to tweak this a little in the code 
>>>> and see if it has any effect.
>>>>
>>>> Oh, there we go:   20 megs in 135.8 seconds.  Yeah... I might want 
>>>> to try to improve that...
>>>
>>> After some tweaking of the block size, I managed to get the DL speed 
>>> up to about 900 Mb/s.  It's still not quite Ubuntu, but it's
>>>  a good order of magnitude better.  The new DL code is pretty much
>>>  this:
>>>
>>> """ blocksize = 2 ** 16    # plus or minus a power of 2 source = 
>>> urllib2.urlopen( 'url://string' ) target = open( pathname, 'wb') 
>>> fullsize = float( source.info()['Content-Length'] ) DLd = 0 while DLd 
>>> < fullsize: DLd = DLd + blocksize # optional:  write some DL progress 
>>> info # somewhere, e.g. stdout target.close() source.close() """
>>>
>> I'd like to suggest that the block size you add to 'DLd' be the actual 
>> size of the returned block, just in case the read() doesn't return all 
>> you asked for (it might not be guaranteed, and the chances
>>  are that the final block will be shorter, unless 'fullsize' happens
>>  to be a multiple of 'blocksize').
>>
>> If less is returned by read() then the while-loop might finish before
>>  all the data has been downloaded, and if you just add 'blocksize' 
>> each time it might end up > 'fullsize', ie apparently >100% downloaded!
> 
> Interesting.  I'll if to see if any of the downloaded files end 
> prematurely :)
> 
> btw, I forgot the most important line of the code!
> 
> """
> blocksize = 2 ** 16    # plus or minus a power of 2
> source = urllib2.urlopen( 'url://string' )
> target = open( pathname, 'wb')
> fullsize = float( source.info()['Content-Length'] )
> DLd = 0
> while DLd < fullsize:
>     #  +++
>     target.write( source.read( blocksize ) )  # +++
>     #  +++
>     DLd = DLd + blocksize
>     # optional:  write some DL progress info
>     # somewhere, e.g. stdout
> target.close()
> source.close()
> """
> 
> Using that, I'm not quite sure where I can grab onto the value of how 
> much was actually read from the block.  I suppose I could use an 
> intermediate variable, read the data into it, measure the size, and then 
> write it to the file stream, but I'm not sure it would be worth the 
> overhead.  Or is there some other magic I should know about?
> 
> If I start to get that problem, at least I'll know where to look...
> 
It's just:

     data = source.read(blocksize)
     target.write(data)
     DLd = DLd + len(data)

The overhead is tiny because you're not copying the data.

If 'x' refers to a 1MB bytestring and you do "y = x" or "foo(x)", you're 
not actually copying that bytestring; you're just making 'y' also refer 
to it or passing the reference to it into 'foo'. It's a bit passing 
pointers around, but without the nasty bits! :-)