Fetching a clean copy of a changing web page
Carsten Haese
carsten at uniqsys.com
Tue Jul 17 00:05:59 EDT 2007
On Tue, 2007-07-17 at 00:47 +0000, John Nagle wrote:
> Miles wrote:
> > On Jul 16, 1:00 am, John Nagle <na... at animats.com> wrote:
> >
> >> I'm reading the PhishTank XML file of active phishing sites,
> >>at "http://data.phishtank.com/data/online-valid/" This changes
> >>frequently, and it's big (about 10MB right now) and on a busy server.
> >>So once in a while I get a bogus copy of the file because the file
> >>was rewritten while being sent by the server.
> >>
> >> Any good way to deal with this, short of reading it twice
> >>and comparing?
> >>
> >> John Nagle
> >
> >
> > Sounds like that's the host's problem--they should be using atomic
> > writes, which is usally done be renaming the new file on top of the
> > old one. How "bogus" are the bad files? If it's just incomplete,
> > then since it's XML, it'll be missing the "</output>" and you should
> > get a parse error if you're using a suitable strict parser. If it's
> > mixed old data and new data, but still manages to be well-formed XML,
> > then yes, you'll probably have to read it twice.
>
> The files don't change much from update to update; typically they
> contain about 10,000 entries, and about 5-10 change every hour. So
> the odds of getting a seemingly valid XML file with incorrect data
> are reasonably good.
Does the server return a reliable last-modified timestamp? If yes, you
can do something like this:
prev_last_mod = None
while True:
u = urllib.urlopen(theUrl)
if prev_last_mod==u.headers['last-modified']:
break
prev_last_mod = u.headers['last-modified']
contents = u.read()
u.close()
That way, you only have to re-read the file if it actually changed
according to the time stamp, rather than having to re-read in any case
just to check whether it changed.
HTH,
--
Carsten Haese
http://informixdb.sourceforge.net
More information about the Python-list
mailing list