which datastructure for fast sorted insert?

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Sun May 25 22:25:33 EDT 2008


En Sun, 25 May 2008 22:42:06 -0300, <notnorwegian at yahoo.se> escribió:

> def joinSets(set1, set2):
>     for i in set2:
>         set1.add(i)
>     return set1

Use the | operator, or |=

> Traceback (most recent call last):
>   File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
>     x = scrapeSites("http://www.yahoo.com")
>   File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
> scrapeSites
>     site = iterator.next()
> RuntimeError: Set changed size during iteration

You will need two sets: the one you're iterating over, and another collecting new urls. Once you finish iterating the first, continue with the new ones; stop when it's empty.

> def scrapeSites(startAddress):
>     site = startAddress
>     sites = set()
>     iterator = iter(sites)
>     pos = 0
>     while pos < 10:#len(sites):
>         newsites = scrapeSite(site)
>         joinSets(sites, newsites)
>         pos += 1
>         site = iterator.next()
>     return sites

Try this (untested):

def scrapeSites(startAddress):
     allsites = set() # all links found so far
     pending = set([startAddress]) # pending sites to examine
     while pending:
       newsites = set() # new links
       for site in pending:
         newsites |= scrapeSite(site)
       pending = newsites - allsites
       allsites |= newsites
     return allsites

> wtf? im not multithreading or anything so how can the size change here?

You modified the set you were iterating over. Another example of the same problem:

d = {'a': 1, 'b': 2, 'c':3}
for key in d:
   d[key+key]=0

-- 
Gabriel Genellina




More information about the Python-list mailing list