Fastest way to retrieve and write html contents to file

DFS nospam at dfs.com
Mon May 2 03:37:09 EDT 2016


On 5/2/2016 2:27 AM, Stephen Hansen wrote:
> On Sun, May 1, 2016, at 10:59 PM, DFS wrote:
>> startTime = time.clock()
>> for i in range(loops):
>> 	r = urllib2.urlopen(webpage)
>> 	f = open(webfile,"w")
>> 	f.write(r.read())
>> 	f.close
>> endTime = time.clock()
>> print "Finished urllib2 in %.2g seconds" %(endTime-startTime)
>
> Yeah on my system I get 1.8 out of this, amounting to 0.18s.

You get 1.8 seconds total for the 10 loops?  That's less than half as 
fast as my results.  Surprising.


> I'm again going back to the point of: its fast enough. When comparing
> two small numbers, "twice as slow" is meaningless.

Speed is always meaningful.

I know python is relatively slow, but it's a cool, concise, powerful 
language.  I'm extremely impressed by how tight the code can get.


> You have an assumption you haven't answered, that downloading a 10 meg
> file will be twice as slow as downloading this tiny file. You haven't
> proven that at all.

True.  And it has been my assumption - tho not with 10MB file.


> I suspect you have a constant overhead of X, and in this toy example,
> that makes it seem twice as slow. But when downloading a file of size,
> you'll have the same constant factor, at which point the difference is
> irrelevant.

Good point.  Test below.


> If you believe otherwise, demonstrate it.

http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2

It's a 58854 byte file when saved to disk (smaller file was 3546 bytes), 
so this is 16.6x larger.  So I would expect python to linearly run in 
16.6 * 0.88 = 14.6 seconds.

10 loops per run

1st run
$ python timeGetHTML.py
Finished urllib in 8.5 seconds
Finished urllib2 in 5.6 seconds
Finished requests in 7.8 seconds
Finished pycurl in 6.5 seconds

wait a couple minutes, then 2nd run
$ python timeGetHTML.py
Finished urllib in 5.6 seconds
Finished urllib2 in 5.7 seconds
Finished requests in 5.2 seconds
Finished pycurl in 6.4 seconds

It's a little more than 1/3 of my estimate - so good news.

(when I was doing these tests, some of the python results were 0.75 
seconds - way too fast, so I checked and no data was written to file, 
and I couldn't even open the webpage with a browser.  Looks like I had 
been temporarily blocked from the site.  After a couple minutes, I was 
able to access it again).

I noticed urllib and curl returned the html as is, but urllib2 and 
requests added enhancements that should make the data easier to parse. 
Based on speed and functionality and documentation, I believe I'll be 
using the requests HTTP library (I will actually be doing a small amount 
of web scraping).


VBScript
1st run: 7.70 seconds
2nd run: 5.38
3rd run: 7.71

So python matches or beats VBScript at this much larger file.  Kewl.





More information about the Python-list mailing list