urllib2 performance on windows, usb connection
MRAB
google at mrabarnett.plus.com
Fri Feb 6 21:28:09 EST 2009
dq wrote:
> MRAB wrote:
>> dq wrote:
>>> dq wrote:
>>>> MRAB wrote:
>>>>> dq wrote:
>>>>>> Martin v. Löwis wrote:
>>>>>>>> So does anyone know what the deal is with this? Why is the same
>>>>>>>> code so much slower on Windows? Hope someone can tell me before
>>>>>>>> a holy war erupts :-)
>>>>>>>
>>>>>>> Only the holy war can give an answer here. It certainly has
>>>>>>> *nothing* to do with Python; Python calls the operating system
>>>>>>> functions to read from the network and write to the disk almost
>>>>>>> directly. So it must be the operating system itself that slows it
>>>>>>> down.
>>>>>>>
>>>>>>> To investigate further, you might drop the write operating,
>>>>>>> and measure only source.read(). If that is slower, then, for
>>>>>>> some reason, the network speed is bad on Windows. Maybe
>>>>>>> you have the network interfaces misconfigured? Maybe you are
>>>>>>> using wireless on Windows, but cable on Linux? Maybe you have
>>>>>>> some network filtering software running on Windows? Maybe it's
>>>>>>> just that Windows sucks?-)
>>>>>>>
>>>>>>> If the network read speed is fine, but writing slows down,
>>>>>>> I ask the same questions. Perhaps you have some virus scanner
>>>>>>> installed that filters all write operations? Maybe
>>>>>>> Windows sucks?
>>>>>>>
>>>>>>> Regards, Martin
>>>>>>>
>>>>>>
>>>>>> Thanks for the ideas, Martin. I ran a couple of experiments
>>>>>> to find the culprit, by downloading the same 20 MB file from
>>>>>> the same fast server. I compared:
>>>>>>
>>>>>> 1. DL to HD vs USB iPod. 2. AV on-access protection on vs.
>>>>>> off 3. "source. read()" only vs. "file.write(
>>>>>> source.read() )"
>>>>>>
>>>>>> The culprit is definitely the write speed on the iPod. That is,
>>>>>> everything runs plenty fast (~1 MB/s down) as long as I'm
>>>>>> not writing directly to the iPod. This is kind of odd, because if
>>>>>> I copy the file over from the HD to the iPod using
>>>>>> windows (drag-n-drop), it takes about a second or two, so about
>>>>>> 10 MB/s.
>>>>>>
>>>>>> So the problem is definitely partially Windows, but it also seems
>>>>>> that Python's file.write() function is not without blame. It's the
>>>>>> combination of Windows, iPod and Python's data stream that is
>>>>>> slowing me down.
>>>>>>
>>>>>> I'm not really sure what I can do about this. I'll experiment a
>>>>>> little more and see if there's any way around this bottleneck. If
>>>>>> anyone has run into a problem like this,
>>>>>> I'd love to hear about it...
>>>>>>
>>>>> You could try copying the file to the iPod using the command line,
>>>>> or copying data from disk to iPod in, say, C, anything but Python.
>>>>> This would allow you to identify whether Python itself has anything
>>>>> to do with it.
>>>>
>>>> Well, I think I've partially identified the problem. target.write(
>>>> source.read() ) runs perfectly fast, copies 20 megs
>>>> in about a second, from HD to iPod. However, if I run the same
>>>> code in a while loop, using a certain block size, say target.write(
>>>> source.read(4096) ), it takes forever (or at least
>>>> I'm still timing it while I write this post).
>>>>
>>>> The mismatch seems to be between urllib2's block size and the write
>>>> speed of the iPod, I might try to tweak this a little in the code
>>>> and see if it has any effect.
>>>>
>>>> Oh, there we go: 20 megs in 135.8 seconds. Yeah... I might want
>>>> to try to improve that...
>>>
>>> After some tweaking of the block size, I managed to get the DL speed
>>> up to about 900 Mb/s. It's still not quite Ubuntu, but it's
>>> a good order of magnitude better. The new DL code is pretty much
>>> this:
>>>
>>> """ blocksize = 2 ** 16 # plus or minus a power of 2 source =
>>> urllib2.urlopen( 'url://string' ) target = open( pathname, 'wb')
>>> fullsize = float( source.info()['Content-Length'] ) DLd = 0 while DLd
>>> < fullsize: DLd = DLd + blocksize # optional: write some DL progress
>>> info # somewhere, e.g. stdout target.close() source.close() """
>>>
>> I'd like to suggest that the block size you add to 'DLd' be the actual
>> size of the returned block, just in case the read() doesn't return all
>> you asked for (it might not be guaranteed, and the chances
>> are that the final block will be shorter, unless 'fullsize' happens
>> to be a multiple of 'blocksize').
>>
>> If less is returned by read() then the while-loop might finish before
>> all the data has been downloaded, and if you just add 'blocksize'
>> each time it might end up > 'fullsize', ie apparently >100% downloaded!
>
> Interesting. I'll if to see if any of the downloaded files end
> prematurely :)
>
> btw, I forgot the most important line of the code!
>
> """
> blocksize = 2 ** 16 # plus or minus a power of 2
> source = urllib2.urlopen( 'url://string' )
> target = open( pathname, 'wb')
> fullsize = float( source.info()['Content-Length'] )
> DLd = 0
> while DLd < fullsize:
> # +++
> target.write( source.read( blocksize ) ) # +++
> # +++
> DLd = DLd + blocksize
> # optional: write some DL progress info
> # somewhere, e.g. stdout
> target.close()
> source.close()
> """
>
> Using that, I'm not quite sure where I can grab onto the value of how
> much was actually read from the block. I suppose I could use an
> intermediate variable, read the data into it, measure the size, and then
> write it to the file stream, but I'm not sure it would be worth the
> overhead. Or is there some other magic I should know about?
>
> If I start to get that problem, at least I'll know where to look...
>
It's just:
data = source.read(blocksize)
target.write(data)
DLd = DLd + len(data)
The overhead is tiny because you're not copying the data.
If 'x' refers to a 1MB bytestring and you do "y = x" or "foo(x)", you're
not actually copying that bytestring; you're just making 'y' also refer
to it or passing the reference to it into 'foo'. It's a bit passing
pointers around, but without the nasty bits! :-)
More information about the Python-list
mailing list