Writing huge Sets() to disk

Paul McGuire ptmcg at austin.rr._bogus_.com
Mon Jan 10 12:00:18 EST 2005


"Martin MOKREJ©" <mmokrejs at ribosome.natur.cuni.cz> wrote in message
news:mailman.448.1105373471.22381.python-list at python.org...
> Hi,
>   I have sets.Set() objects having up to 20E20 items,
> each is composed of up to 20 characters. Keeping
> them in memory on !GB machine put's me quickly into swap.
> I don't want to use dictionary approach, as I don't see a sense
> to store None as a value. The items in a set are unique.
>
>   How can I write them efficiently to disk? To be more exact,
> I have 20 sets. _set1 has 1E20 keys of size 1 character.
>
> alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',
'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
> for aa1 in alphabet:
>     # l = [aa1]
>     #_set1.add(aa1)
>     for aa2 in alphabet:
>         # l.append(aa2)
>         #_set2.add(''.join(l))
> [cut]
>
>   The reason I went for sets instead of lists is the speed,
> availability of unique, common and other methods.
> What would you propose as an elegant solution?
> Actually, even those nested for loops take ages. :(
> M.

_set1 only has 19 keys of size 1 character - 'A' is duplicated.

Assuming you replace 'A' with another character (say 'Z'), then here is what
you get:
_set1 - 20 elements (20**1)
_set2 - 400 elements (20**2)
_set3 - 8000 elements (20**3)
...
_set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26

If you have no duplicates in your alphabet, then you wont need to use sets,
every combination will be unique.  In this case, just use a generator.

Here's a recursive generator approach that may save you a bunch of nested
editing (set maxDepth as high as you dare):

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
'Y', 'W', 'K', 'R', 'H', 'D', 'E')

maxDepth = 3

def genNextCombo(root=list(),depth=1):
    for a in alphabet:
        thisRoot = root + [a]
        yield "".join( thisRoot )
        if depth < maxDepth:
            for c in genNextCombo(thisRoot, depth+1):
                yield c

for c in genNextCombo():
    print c  # or write to file, or whatever



-- Paul





More information about the Python-list mailing list