Fetching a clean copy of a changing web page

Mon Jul 16 21:26:37 EDT 2007

John Nagle wrote:
> Miles wrote:
>> On Jul 16, 1:00 am, John Nagle <na... at animats.com> wrote:
>>
>>>    I'm reading the PhishTank XML file of active phishing sites,
>>> at "http://data.phishtank.com/data/online-valid/"  This changes
>>> frequently, and it's big (about 10MB right now) and on a busy server.
>>> So once in a while I get a bogus copy of the file because the file
>>> was rewritten while being sent by the server.
>>>
>>>    Any good way to deal with this, short of reading it twice
>>> and comparing?
>>>
>>>                                John Nagle
>>
>> Sounds like that's the host's problem--they should be using atomic
>> writes, which is usally done be renaming the new file on top of the
>> old one.  How "bogus" are the bad files?  If it's just incomplete,
>> then since it's XML, it'll be missing the "</output>" and you should
>> get a parse error if you're using a suitable strict parser.  If it's
>> mixed old data and new data, but still manages to be well-formed XML,
>> then yes, you'll probably have to read it twice.
> 
>     The files don't change much from update to update; typically they
> contain about 10,000 entries, and about 5-10 change every hour.  So
> the odds of getting a seemingly valid XML file with incorrect data
> are reasonably good.
> 
I'm still left wondering what the hell kind of server process will start 
serving one copy of a file and complete the request from another. Oh, well.

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------