Fetching a clean copy of a changing web page

Amit Khemka khemkaamit at gmail.com
Mon Jul 16 04:34:26 EDT 2007


On 7/16/07, John Nagle <nagle at animats.com> wrote:
>     I'm reading the PhishTank XML file of active phishing sites,
> at "http://data.phishtank.com/data/online-valid/"  This changes
> frequently, and it's big (about 10MB right now) and on a busy server.
> So once in a while I get a bogus copy of the file because the file
> was rewritten while being sent by the server.
>
>     Any good way to deal with this, short of reading it twice
> and comparing?
>
If you have:
1. Ball park estimate of the size of XML
2. Some footers or "last tags" in the XML

May be you can use the above to check the xml and catch the "bogus" ones !

cheers,

-- 
----
Amit Khemka
website: www.onyomo.com
wap-site: www.owap.in
Home Page: www.cse.iitd.ernet.in/~csd00377

Endless the world's turn, endless the sun's Spinning, Endless the quest;
I turn again, back to my own beginning, And here, find rest.



More information about the Python-list mailing list