Fastest way to retrieve and write html contents to file

Chris Angelico rosuav at gmail.com
Mon May 2 03:19:01 EDT 2016


On Mon, May 2, 2016 at 4:47 PM, DFS <nospam at dfs.com> wrote:
> I'm not specifying a local web cache with either (wouldn't know how or where
> to look).  If you have Windows, you can try it.
> -------------------------------------------------------------------
> Option Explicit
> Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
> webpage = "http://econpy.pythonanywhere.com/ex/001.html"
> webfile  = "D:\econpy001.html"
> startTime = Timer
> For i = 1 to 10
>  Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
>  xmlHTTP.Open "GET", webpage
>  xmlHTTP.Send
>  Set fso = CreateObject("Scripting.FileSystemObject")
>  Set fOut = fso.CreateTextFile(webfile, True)
>   fOut.WriteLine xmlHTTP.ResponseText
>  fOut.Close
>  Set fOut    = Nothing
>  Set fso     = Nothing
>  Set xmlHTTP = Nothing
> Next
> endTime = Timer
> wscript.echo "Finished VBScript in " & FormatNumber(endTime - startTime,3) &
> " seconds"
> -------------------------------------------------------------------

There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).

Then the next thing to test would be to create a deliberately-slow web
server, and connect to that. Put a two-second delay into it, to
simulate a distant or overloaded server, and see if your logs show the
correct result. Something like this:

--------

import time
try:
    import http.server as BaseHTTPServer # Python 3
except ImportError:
    import BaseHTTPServer # Python 2

class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("Content-type","text/html")
        self.end_headers()
        self.wfile.write(b"Hello, ")
        time.sleep(2)
        self.wfile.write(b"world!")

server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP)
server.serve_forever()

-------

Test that with a web browser or command-line downloader (go to
http://127.0.0.1:1234/), and make sure that (a) it produces "Hello,
world!", and (b) it takes two seconds. Then set your test scripts to
downloading that URL. (Be sure to set them back to low iteration
counts first!) If the times are true and fair, they should all come
out pretty much the same - ten iterations, twenty seconds. And since
all that's changed is the server, this will be an accurate
demonstration of what happens in the real world: network requests
aren't always fast. Incidentally, you can also watch the server's log
to see if it's getting the appropriate number of requests.

It may turn out that changing the web server actually materially
changes your numbers. Comment out the sleep call and try it again -
you might find that your numbers come closer together, because this
naive server doesn't send back 204 NOT MODIFIED responses or anything.
Again, though, this would prove that you're not actually measuring
language performance, because the tests are more dependent on the
server than the client.

Even if the files themselves aren't being cached, you might find that
DNS is. So if you truly want to eliminate variables, replace the name
in your URL with an IP address. It's another thing that might mess
with your timings, without actually being a language feature.

Networking has about four billion variables in it. You're messing with
one of the least significant: the programming language :)

ChrisA



More information about the Python-list mailing list