urllib2 performance on windows, usb connection

dq dq at gmail.com
Sat Feb 7 04:36:14 EST 2009


MRAB wrote:
> dq wrote:
>> MRAB wrote:
>>> dq wrote:
>>>> dq wrote:
>>>>> MRAB wrote:
>>>>>> dq wrote:
>>>>>>> Martin v. Löwis wrote:
>>>>>>>>> So does anyone know what the deal is with this?  Why is the 
>>>>>>>>> same code so much slower on Windows?  Hope someone can tell me 
>>>>>>>>> before a holy war erupts :-)
>>>>>>>>
>>>>>>>> Only the holy war can give an answer here. It certainly has
>>>>>>>>  *nothing* to do with Python; Python calls the operating system 
>>>>>>>> functions to read from the network and write to the disk almost 
>>>>>>>> directly. So it must be the operating system itself that slows 
>>>>>>>> it down.
>>>>>>>>
>>>>>>>> To investigate further, you might drop the write operating,
>>>>>>>>  and measure only source.read(). If that is slower, then, for 
>>>>>>>> some reason, the network speed is bad on Windows. Maybe
>>>>>>>>  you have the network interfaces misconfigured? Maybe you are 
>>>>>>>> using wireless on Windows, but cable on Linux? Maybe you have 
>>>>>>>> some network filtering software running on Windows? Maybe it's 
>>>>>>>> just that Windows sucks?-)
>>>>>>>>
>>>>>>>> If the network read speed is fine, but writing slows down,
>>>>>>>>  I ask the same questions. Perhaps you have some virus scanner 
>>>>>>>> installed that filters all write operations? Maybe
>>>>>>>>  Windows sucks?
>>>>>>>>
>>>>>>>> Regards, Martin
>>>>>>>>
>>>>>>>
>>>>>>> Thanks for the ideas, Martin.  I ran a couple of experiments
>>>>>>>  to find the culprit, by downloading the same 20 MB file from
>>>>>>>  the same fast server. I compared:
>>>>>>>
>>>>>>> 1.  DL to HD vs USB iPod. 2.  AV on-access protection on vs.
>>>>>>>  off 3.  "source. read()" only vs.  "file.write(
>>>>>>> source.read() )"
>>>>>>>
>>>>>>> The culprit is definitely the write speed on the iPod.  That is, 
>>>>>>> everything runs plenty fast (~1 MB/s down) as long as I'm
>>>>>>> not writing directly to the iPod.  This is kind of odd, because 
>>>>>>> if I copy the file over from the HD to the iPod using
>>>>>>>  windows (drag-n-drop), it takes about a second or two, so about 
>>>>>>> 10 MB/s.
>>>>>>>
>>>>>>> So the problem is definitely partially Windows, but it also seems 
>>>>>>> that Python's file.write() function is not without blame. It's 
>>>>>>> the combination of Windows, iPod and Python's data stream that is 
>>>>>>> slowing me down.
>>>>>>>
>>>>>>> I'm not really sure what I can do about this.  I'll experiment a 
>>>>>>> little more and see if there's any way around this bottleneck.  
>>>>>>> If anyone has run into a problem like this,
>>>>>>>  I'd love to hear about it...
>>>>>>>
>>>>>> You could try copying the file to the iPod using the command line, 
>>>>>> or copying data from disk to iPod in, say, C, anything but Python. 
>>>>>> This would allow you to identify whether Python itself has 
>>>>>> anything to do with it.
>>>>>
>>>>> Well, I think I've partially identified the problem. target.write( 
>>>>> source.read() ) runs perfectly fast, copies 20 megs
>>>>>  in about a second, from HD to iPod.  However, if I run the same
>>>>>  code in a while loop, using a certain block size, say 
>>>>> target.write( source.read(4096) ), it takes forever (or at least
>>>>>  I'm still timing it while I write this post).
>>>>>
>>>>> The mismatch seems to be between urllib2's block size and the write 
>>>>> speed of the iPod, I might try to tweak this a little in the code 
>>>>> and see if it has any effect.
>>>>>
>>>>> Oh, there we go:   20 megs in 135.8 seconds.  Yeah... I might want 
>>>>> to try to improve that...
>>>>
>>>> After some tweaking of the block size, I managed to get the DL speed 
>>>> up to about 900 Mb/s.  It's still not quite Ubuntu, but it's
>>>>  a good order of magnitude better.  The new DL code is pretty much
>>>>  this:
>>>>
>>>> """ blocksize = 2 ** 16    # plus or minus a power of 2 source = 
>>>> urllib2.urlopen( 'url://string' ) target = open( pathname, 'wb') 
>>>> fullsize = float( source.info()['Content-Length'] ) DLd = 0 while 
>>>> DLd < fullsize: DLd = DLd + blocksize # optional:  write some DL 
>>>> progress info # somewhere, e.g. stdout target.close() source.close() 
>>>> """
>>>>
>>> I'd like to suggest that the block size you add to 'DLd' be the 
>>> actual size of the returned block, just in case the read() doesn't 
>>> return all you asked for (it might not be guaranteed, and the chances
>>>  are that the final block will be shorter, unless 'fullsize' happens
>>>  to be a multiple of 'blocksize').
>>>
>>> If less is returned by read() then the while-loop might finish before
>>>  all the data has been downloaded, and if you just add 'blocksize' 
>>> each time it might end up > 'fullsize', ie apparently >100% downloaded!
>>
>> Interesting.  I'll if to see if any of the downloaded files end 
>> prematurely :)
>>
>> btw, I forgot the most important line of the code!
>>
>> """
>> blocksize = 2 ** 16    # plus or minus a power of 2
>> source = urllib2.urlopen( 'url://string' )
>> target = open( pathname, 'wb')
>> fullsize = float( source.info()['Content-Length'] )
>> DLd = 0
>> while DLd < fullsize:
>>     #  +++
>>     target.write( source.read( blocksize ) )  # +++
>>     #  +++
>>     DLd = DLd + blocksize
>>     # optional:  write some DL progress info
>>     # somewhere, e.g. stdout
>> target.close()
>> source.close()
>> """
>>
>> Using that, I'm not quite sure where I can grab onto the value of how 
>> much was actually read from the block.  I suppose I could use an 
>> intermediate variable, read the data into it, measure the size, and 
>> then write it to the file stream, but I'm not sure it would be worth 
>> the overhead.  Or is there some other magic I should know about?
>>
>> If I start to get that problem, at least I'll know where to look...
>>
> It's just:
> 
>     data = source.read(blocksize)
>     target.write(data)
>     DLd = DLd + len(data)
> 
> The overhead is tiny because you're not copying the data.
> 
> If 'x' refers to a 1MB bytestring and you do "y = x" or "foo(x)", you're 
> not actually copying that bytestring; you're just making 'y' also refer 
> to it or passing the reference to it into 'foo'. It's a bit passing 
> pointers around, but without the nasty bits! :-)

Yeah, that's about what I was thinking, although not quite as 
succintly.  Thanks for the help!



More information about the Python-list mailing list