[Tutor] MemoryError !!! Help Required

Andreas Kostyrka andreas at kostyrka.org
Mon Apr 7 08:25:11 CEST 2008


Am Montag, den 07.04.2008, 00:32 -0500 schrieb Luke Paireepinart:
> devj wrote:
> > Hi,
> > I am making a web crawler using Python.To avoid dupliacy of urls,i have to
> > maintain lists of downloaded urls and to-be-downloaded urls ,of which the
> > latter grows exponentially,resulting in a MemoryError exception .What are
> > the possible ways to avoid this ??
> >   
> get more RAM, store the list on your hard drive, etc. etc. 
> Why are you trying to do this?  Are you sure you can't use existing 
> tools for this such as wget?
> -Luke

Also traditional solutions involve e.g. remembering a hash value.

Plus if you go for a simple file based solution, you probably should
store it by hostname, e.g.:
http://123.45.67.87/abc/def/text.html => file("127/45/67/87",
"w").write("/abc/def/text.html")
(guess you need to run os.makedirs as needed :-P)

These makes it scaleable (by not storying to many files in one
directory, and by leaving out the common element so the files are
smaller and faster to read), while keeping the code relative simple.

Another solution would be shelve, but you have to keep in mind that if
you are unlucky you might loose the database. (Some of the DBs that
anydbm might not survive power loss, or other problems to well)

Andreas

> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Dies ist ein digital signierter Nachrichtenteil
Url : http://mail.python.org/pipermail/tutor/attachments/20080407/df2e4af0/attachment.pgp 


More information about the Tutor mailing list