Fetching a clean copy of a changing web page

Mon Jul 16 04:50:40 EDT 2007

Diez B. Roggisch wrote:
> John Nagle wrote:
>>    I'm reading the PhishTank XML file of active phishing sites,
>> at "http://data.phishtank.com/data/online-valid/"  This changes
>> frequently, and it's big (about 10MB right now) and on a busy server.
>> So once in a while I get a bogus copy of the file because the file
>> was rewritten while being sent by the server.
>>
>>    Any good way to deal with this, short of reading it twice
>> and comparing?
> 
> Apart from that - the only thing you could try is to apply a SAX parser
> on the input stream immediatly, so that at least if the XML is non-valid
> because of the way they serve it you get to that ASAP.

Sure, if you want to use lxml.etree, you can pass the URL right into
etree.parse() and it will throw an exception if parsing from the URL fails to
yield a well-formed document.

http://codespeak.net/lxml/
http://codespeak.net/lxml/dev/parsing.html

BTW, parsing and serialising it back to a string is most likely dominated by
the time it takes to transfer the document over the network, so it will not be
much slower than reading it using urlopen() and the like.

Stefan