Writing huge Sets() to disk

Mon Jan 10 16:35:42 EST 2005

Tim Peters wrote:
> [Martin MOKREJŠ]
>> just imagine, you want to compare how many words are in English, German,
>>Czech, Polish disctionary. You collect words from every language and record
>>them in dict or Set, as you wish.
> 
> Call the set of all English words E; G, C, and P similarly.
> 
>> Once you have those Set's or dict's for those 4 languages, you ask
>>for common words
> 
> This Python expression then gives the set of words common to all 4:
>     E & G & C & P
>>and for those unique to Polish.
>     P -  E - G  - C

One attack is to define a hash to get sizes smaller.  Suppose you find
your word sets are 10**8 large, and you find you need to deal with sets
of size 10**6.  Then if you define a hash that produces an integer below
100, you'll find:
     E & G & C & P == Union(E_k & G_k & C_k & P_k) for k in range(100)
     P -  E - G  - C == Union(P_k - E_k - G_k - C_k) for k in range(100)
where X_k = [v for v in X if somehash(v) == k]
This means, you can separate the calculation into much smaller buckets,
and combine the buckets back at the end (without any comparison on the
combining operations).

For calculating these two expressions, you can dump values out to a
file per hash per language, sort and dupe-elim the contents of the
various files (maybe dupe-elim on the way out).  Then hash-by-hash,
you can calculate parts of the results by combining iterators like
inboth and leftonly below on iterators producing words from files.

     def dupelim(iterable):
         source = iter(iterable)
         former = source.next()  # Raises StopIteration if empty
         yield former
         for element in source:
             if element != former:
                 yield element
                 former = element

     def inboth(left, right):
         '''Take ordered iterables and yield matching cases.'''
         lsource = iter(left)
         lhead = lsource.next()  # May StopIteration (and we're done)
         rsource = iter(right)
         for rhead in rsource:
             while lhead < rhead:
                 lhead = lsource.next()
             if lhead == rhead:
                 yield lhead
                 lhead = lsource.next()

     def leftonly(left, right):
         '''Take ordered iterables and yield matching cases.'''
         lsource = iter(left)
         rsource = iter(right)
         try:
             rhead = rsource.next()
         except StopIteration: # empty right side.
             for lhead in lsource:
                 yield lhead
         else:
             for lhead in lsource:
                 try:
                     while rhead < lhead:
                         rhead = rsource.next()
                     if lhead < rhead:
                         yield lhead
                 except StopIteration: # Ran out of right side.
                     yield lhead
                     for lhead in lsource:
                         yield lhead
                     break

> Real-word dictionaries shouldn't be a problem.  I recommend you store
> each as a plain text file, one word per line.  Then, e.g., to convert
> that into a set of words, do
> 
>     f = open('EnglishWords.txt')
>     set_of_English_words = set(f)
>     f.close()
> 
> You'll have a trailing newline character in each word, but that
> doesn't really matter.
> 
> Note that if you sort the word-per-line text files first, the Unix
> `comm` utility can be used to perform intersection and difference on a
> pair at a time with virtually no memory burden (and regardless of file
> size).
In fact, once you've sorted these files, you can use the iterators
above to combine those sorted files.

For example:

     Ponly = open('polishonly.txt', 'w')
     every = open('every.txt', 'w')
     for hashcode in range(100):
         English = open('english_%s.txt' % hashcode)
         German = open('german_%s.txt' % hashcode)
         Czech = open('czech_%s.txt' % hashcode)
         Polish = open('polish_%s.txt' % hashcode)
         for unmatched in leftonly(leftonly(leftonly(dupelim(Polish),
                 dupelim(English)), dupelim(German)), dupelim(Czech)):
             Ponly.write(unmatched)
         English.seek(0)
         German.seek(0)
         Czech.seek(0)
         Polish.seek(0)
         for matched in inboth(inboth(dupelim(Polish), dupelim(English)),
                               inboth(dupelim(German), dupelim(Czech))):
             every.write(matched)
         English.close()
         German.close()
         Czech.close()
         Polish.close()
     Ponly.close()
     every.close()

--Scott David Daniels
Scott.Daniels at Acm.Org