urllib2 performance on windows, usb connection
dq
dq at gmail.com
Sat Feb 7 04:36:14 EST 2009
MRAB wrote:
> dq wrote:
>> MRAB wrote:
>>> dq wrote:
>>>> dq wrote:
>>>>> MRAB wrote:
>>>>>> dq wrote:
>>>>>>> Martin v. Löwis wrote:
>>>>>>>>> So does anyone know what the deal is with this? Why is the
>>>>>>>>> same code so much slower on Windows? Hope someone can tell me
>>>>>>>>> before a holy war erupts :-)
>>>>>>>>
>>>>>>>> Only the holy war can give an answer here. It certainly has
>>>>>>>> *nothing* to do with Python; Python calls the operating system
>>>>>>>> functions to read from the network and write to the disk almost
>>>>>>>> directly. So it must be the operating system itself that slows
>>>>>>>> it down.
>>>>>>>>
>>>>>>>> To investigate further, you might drop the write operating,
>>>>>>>> and measure only source.read(). If that is slower, then, for
>>>>>>>> some reason, the network speed is bad on Windows. Maybe
>>>>>>>> you have the network interfaces misconfigured? Maybe you are
>>>>>>>> using wireless on Windows, but cable on Linux? Maybe you have
>>>>>>>> some network filtering software running on Windows? Maybe it's
>>>>>>>> just that Windows sucks?-)
>>>>>>>>
>>>>>>>> If the network read speed is fine, but writing slows down,
>>>>>>>> I ask the same questions. Perhaps you have some virus scanner
>>>>>>>> installed that filters all write operations? Maybe
>>>>>>>> Windows sucks?
>>>>>>>>
>>>>>>>> Regards, Martin
>>>>>>>>
>>>>>>>
>>>>>>> Thanks for the ideas, Martin. I ran a couple of experiments
>>>>>>> to find the culprit, by downloading the same 20 MB file from
>>>>>>> the same fast server. I compared:
>>>>>>>
>>>>>>> 1. DL to HD vs USB iPod. 2. AV on-access protection on vs.
>>>>>>> off 3. "source. read()" only vs. "file.write(
>>>>>>> source.read() )"
>>>>>>>
>>>>>>> The culprit is definitely the write speed on the iPod. That is,
>>>>>>> everything runs plenty fast (~1 MB/s down) as long as I'm
>>>>>>> not writing directly to the iPod. This is kind of odd, because
>>>>>>> if I copy the file over from the HD to the iPod using
>>>>>>> windows (drag-n-drop), it takes about a second or two, so about
>>>>>>> 10 MB/s.
>>>>>>>
>>>>>>> So the problem is definitely partially Windows, but it also seems
>>>>>>> that Python's file.write() function is not without blame. It's
>>>>>>> the combination of Windows, iPod and Python's data stream that is
>>>>>>> slowing me down.
>>>>>>>
>>>>>>> I'm not really sure what I can do about this. I'll experiment a
>>>>>>> little more and see if there's any way around this bottleneck.
>>>>>>> If anyone has run into a problem like this,
>>>>>>> I'd love to hear about it...
>>>>>>>
>>>>>> You could try copying the file to the iPod using the command line,
>>>>>> or copying data from disk to iPod in, say, C, anything but Python.
>>>>>> This would allow you to identify whether Python itself has
>>>>>> anything to do with it.
>>>>>
>>>>> Well, I think I've partially identified the problem. target.write(
>>>>> source.read() ) runs perfectly fast, copies 20 megs
>>>>> in about a second, from HD to iPod. However, if I run the same
>>>>> code in a while loop, using a certain block size, say
>>>>> target.write( source.read(4096) ), it takes forever (or at least
>>>>> I'm still timing it while I write this post).
>>>>>
>>>>> The mismatch seems to be between urllib2's block size and the write
>>>>> speed of the iPod, I might try to tweak this a little in the code
>>>>> and see if it has any effect.
>>>>>
>>>>> Oh, there we go: 20 megs in 135.8 seconds. Yeah... I might want
>>>>> to try to improve that...
>>>>
>>>> After some tweaking of the block size, I managed to get the DL speed
>>>> up to about 900 Mb/s. It's still not quite Ubuntu, but it's
>>>> a good order of magnitude better. The new DL code is pretty much
>>>> this:
>>>>
>>>> """ blocksize = 2 ** 16 # plus or minus a power of 2 source =
>>>> urllib2.urlopen( 'url://string' ) target = open( pathname, 'wb')
>>>> fullsize = float( source.info()['Content-Length'] ) DLd = 0 while
>>>> DLd < fullsize: DLd = DLd + blocksize # optional: write some DL
>>>> progress info # somewhere, e.g. stdout target.close() source.close()
>>>> """
>>>>
>>> I'd like to suggest that the block size you add to 'DLd' be the
>>> actual size of the returned block, just in case the read() doesn't
>>> return all you asked for (it might not be guaranteed, and the chances
>>> are that the final block will be shorter, unless 'fullsize' happens
>>> to be a multiple of 'blocksize').
>>>
>>> If less is returned by read() then the while-loop might finish before
>>> all the data has been downloaded, and if you just add 'blocksize'
>>> each time it might end up > 'fullsize', ie apparently >100% downloaded!
>>
>> Interesting. I'll if to see if any of the downloaded files end
>> prematurely :)
>>
>> btw, I forgot the most important line of the code!
>>
>> """
>> blocksize = 2 ** 16 # plus or minus a power of 2
>> source = urllib2.urlopen( 'url://string' )
>> target = open( pathname, 'wb')
>> fullsize = float( source.info()['Content-Length'] )
>> DLd = 0
>> while DLd < fullsize:
>> # +++
>> target.write( source.read( blocksize ) ) # +++
>> # +++
>> DLd = DLd + blocksize
>> # optional: write some DL progress info
>> # somewhere, e.g. stdout
>> target.close()
>> source.close()
>> """
>>
>> Using that, I'm not quite sure where I can grab onto the value of how
>> much was actually read from the block. I suppose I could use an
>> intermediate variable, read the data into it, measure the size, and
>> then write it to the file stream, but I'm not sure it would be worth
>> the overhead. Or is there some other magic I should know about?
>>
>> If I start to get that problem, at least I'll know where to look...
>>
> It's just:
>
> data = source.read(blocksize)
> target.write(data)
> DLd = DLd + len(data)
>
> The overhead is tiny because you're not copying the data.
>
> If 'x' refers to a 1MB bytestring and you do "y = x" or "foo(x)", you're
> not actually copying that bytestring; you're just making 'y' also refer
> to it or passing the reference to it into 'foo'. It's a bit passing
> pointers around, but without the nasty bits! :-)
Yeah, that's about what I was thinking, although not quite as
succintly. Thanks for the help!
More information about the Python-list
mailing list