Fetching a clean copy of a changing web page

John Nagle nagle at animats.com
Mon Jul 16 13:14:42 EDT 2007


Miles wrote:
> On Jul 16, 1:00 am, John Nagle <na... at animats.com> wrote:
> 
>>    I'm reading the PhishTank XML file of active phishing sites,
>>at "http://data.phishtank.com/data/online-valid/"  This changes
>>frequently, and it's big (about 10MB right now) and on a busy server.
>>So once in a while I get a bogus copy of the file because the file
>>was rewritten while being sent by the server.
>>
>>    Any good way to deal with this, short of reading it twice
>>and comparing?
>>
>>                                John Nagle
> 
> 
> Sounds like that's the host's problem--they should be using atomic
> writes, which is usally done be renaming the new file on top of the
> old one.  How "bogus" are the bad files?  If it's just incomplete,
> then since it's XML, it'll be missing the "</output>" and you should
> get a parse error if you're using a suitable strict parser.  If it's
> mixed old data and new data, but still manages to be well-formed XML,
> then yes, you'll probably have to read it twice.
> 
> -Miles

      Yes, they're updating it non-atomically.

      I'm now reading it twice and comparing, which works.
Actually, it's read up to 5 times, until the same contents
appear twice in a row.  Two tries usually work, but if the
server is updating, it may require more.

      Ugly, and doubles the load on the server, but necessary to
get a consistent copy of the data.

					John Nagle



More information about the Python-list mailing list