Fetching a clean copy of a changing web page
Diez B. Roggisch
deets at nospam.web.de
Mon Jul 16 02:05:32 EDT 2007
John Nagle schrieb:
> I'm reading the PhishTank XML file of active phishing sites,
> at "http://data.phishtank.com/data/online-valid/" This changes
> frequently, and it's big (about 10MB right now) and on a busy server.
> So once in a while I get a bogus copy of the file because the file
> was rewritten while being sent by the server.
>
> Any good way to deal with this, short of reading it twice
> and comparing?
Make them fix the obvious bug they have would be the best of course.
Apart from that - the only thing you could try is to apply a SAX parser
on the input stream immediatly, so that at least if the XML is non-valid
because of the way they serve it you get to that ASAP. But it will only
shave off a few moments.
Diez
More information about the Python-list
mailing list