Fetching a clean copy of a changing web page

Diez B. Roggisch deets at nospam.web.de
Mon Jul 16 02:05:32 EDT 2007


John Nagle schrieb:
>    I'm reading the PhishTank XML file of active phishing sites,
> at "http://data.phishtank.com/data/online-valid/"  This changes
> frequently, and it's big (about 10MB right now) and on a busy server.
> So once in a while I get a bogus copy of the file because the file
> was rewritten while being sent by the server.
> 
>    Any good way to deal with this, short of reading it twice
> and comparing?

Make them fix the obvious bug they have would be the best of course.

Apart from that - the only thing you could try is to apply a SAX parser 
on the input stream immediatly, so that at least if the XML is non-valid 
because of the way they serve it you get to that ASAP. But it will only 
shave off a few moments.

Diez



More information about the Python-list mailing list