Fetching a clean copy of a changing web page
Amit Khemka
khemkaamit at gmail.com
Mon Jul 16 04:34:26 EDT 2007
On 7/16/07, John Nagle <nagle at animats.com> wrote:
> I'm reading the PhishTank XML file of active phishing sites,
> at "http://data.phishtank.com/data/online-valid/" This changes
> frequently, and it's big (about 10MB right now) and on a busy server.
> So once in a while I get a bogus copy of the file because the file
> was rewritten while being sent by the server.
>
> Any good way to deal with this, short of reading it twice
> and comparing?
>
If you have:
1. Ball park estimate of the size of XML
2. Some footers or "last tags" in the XML
May be you can use the above to check the xml and catch the "bogus" ones !
cheers,
--
----
Amit Khemka
website: www.onyomo.com
wap-site: www.owap.in
Home Page: www.cse.iitd.ernet.in/~csd00377
Endless the world's turn, endless the sun's Spinning, Endless the quest;
I turn again, back to my own beginning, And here, find rest.
More information about the Python-list
mailing list