Fetching a clean copy of a changing web page

Tue Jul 17 13:16:08 EDT 2007

On Jul 16, 4:50 am, Stefan Behnel <stefan.behnel-n05... at web.de> wrote:
> Diez B. Roggisch wrote:
> > John Nagle wrote:
> >>    I'm reading the PhishTank XML file of active phishing sites,
> >> at "http://data.phishtank.com/data/online-valid/"  This changes
> >> frequently, and it's big (about 10MB right now) and on a busy server.
> >> So once in a while I get a bogus copy of the file because the file
> >> was rewritten while being sent by the server.
>
> >>    Any good way to deal with this, short of reading it twice
> >> and comparing?
>
> > Apart from that - the only thing you could try is to apply a SAX parser
> > on the input stream immediatly, so that at least if the XML is non-valid
> > because of the way they serve it you get to that ASAP.
>
> Sure, if you want to use lxml.etree, you can pass the URL right into
> etree.parse() and it will throw an exception if parsing from the URL fails to
> yield a well-formed document.
>
> http://codespeak.net/lxml/http://codespeak.net/lxml/dev/parsing.html
>
> BTW, parsing and serialising it back to a string is most likely dominated by
> the time it takes to transfer the document over the network, so it will not be
> much slower than reading it using urlopen() and the like.
>
> Stefan

xml.etree.ElementTree is in the standard lib now, too. Also,
xml.etree.cElementTree, which has the same interface but is blindingly
fast. (I'm working on a program which needs to read/recreate the
(badly designed, horrible, evil) iTunes Library XML, of which mine is
about 10mb, and cEtree parses it in under a second and 60mb of ram
(whearas minidom takes like two minutes and 600+mb to do the same
thing).)

(I mean really -- the playlists are stored as five megs of lists with
elements that are dictionaries of one element, all looking exactly
like this: <dict>\n<key>Track ID</key><integer>4521</integer>\n</dict>
\n --- </rant>)

--
<weaver>star</weaver>