Fetching a clean copy of a changing web page

Miles semanticist at gmail.com
Mon Jul 16 02:09:50 EDT 2007


On Jul 16, 1:00 am, John Nagle <na... at animats.com> wrote:
>     I'm reading the PhishTank XML file of active phishing sites,
> at "http://data.phishtank.com/data/online-valid/"  This changes
> frequently, and it's big (about 10MB right now) and on a busy server.
> So once in a while I get a bogus copy of the file because the file
> was rewritten while being sent by the server.
>
>     Any good way to deal with this, short of reading it twice
> and comparing?
>
>                                 John Nagle

Sounds like that's the host's problem--they should be using atomic
writes, which is usally done be renaming the new file on top of the
old one.  How "bogus" are the bad files?  If it's just incomplete,
then since it's XML, it'll be missing the "</output>" and you should
get a parse error if you're using a suitable strict parser.  If it's
mixed old data and new data, but still manages to be well-formed XML,
then yes, you'll probably have to read it twice.

-Miles




More information about the Python-list mailing list