Writing huge Sets() to disk

Mon Jan 10 19:30:21 EST 2005

Martin MOKREJ¦ <mmokrejs at ribosome.natur.cuni.cz> writes:
> >>  I have sets.Set() objects having up to 20E20 items,
>   just imagine, you want to compare how many words are in English, German,
> Czech, Polish disctionary. You collect words from every language and record
> them in dict or Set, as you wish.
> 
>   Once you have those Set's or dict's for those 4 languages, you ask
> for common words and for those unique to Polish. I have no estimates
> of real-world numbers, but we might be in range of 1E6 or 1E8?
> I believe in any case, huge.

They'll be less than 1e6 and so not huge by the standards of today's
computers.  You could use ordinary dicts or sets.

1e20 is another matter.  I doubt that there are any computers in the
world with that much storage.  How big is your dataset REALLY?

>   I wanted to be able to get a list of words NOT found in say Polish,
> and therefore wanted to have a list of all, theoretically existing words.
> In principle, I can drop this idea of having ideal, theoretical lexicon.
> But have to store those real-world dictionaries anyway to hard drive.

One way you could do it is by dumping all the words sequentially to
disk, then sorting the resulting disk file using an O(n log n) offline
algorithm.

Basically data sets of this size are outside of what you can easily
handle with builtin Python operations without putting some thought
into algorithms and data structures.  From "ribosome" I'm guessing
you're doing computational biology.  If you're going to be writing
code for these kinds of problems on a regular basis, you probably
ought to read a book or two on the topic.  "CLRS" is a good choice:

  http://theory.lcs.mit.edu/~clr/
  http://mitpress.mit.edu/algorithms/