Writing huge Sets() to disk

Mon Jan 10 19:11:29 EST 2005

Martin MOKREJŠ wrote:
>.... But I don't think I can use one-way hashes, as I need to reconstruct
> the string later. I have to study hard to get an idea what the proposed 
 > code really does.
> 
> Scott David Daniels wrote:
>> Tim Peters wrote:
>>> .... Call the set of all English words E; G, C, and P similarly.
>>>
>>> This Python expression then gives the set of words common to all 4:
>>>     E & G & C & P
>>> and for those unique to Polish.
>>>     P -  E - G  - C
>>
>> One attack is to define a hash to get sizes smaller.  Suppose you find
>> your word sets are 10**8 large, and you find you need to deal with sets
>> of size 10**6.  Then if you define a hash that produces an integer below
>> 100, you'll find:
>>     E & G & C & P == Union(E_k & G_k & C_k & P_k) for k in range(100)
>>     P -  E - G  - C == Union(P_k - E_k - G_k - C_k) for k in range(100)
>> where X_k = [v for v in X if somehash(v) == k]
> 
> I don't understand here what would be the result from the hashing 
> function. :(  ... Can you clarify this please in more detail?

The trick is to build the X_k values into files.
For example:

for base in range(0, 100, 10):
     # work in blocks of 10 (or whatever)
     for language in ('English', 'German', 'Czech', 'Polish'):
         source = open(language)
         dests = [open('%s_%s.txt' % (language.lower(), base + n), 'w')
                  for n in range(10)]
         for word in source:  # Actually this is probably more involved
             code = somehash(word) - base
             if 0 <= code < base:
                 dests[code].write(word + '\n')
         for dest in dests:
             dest.close()
         source.close()

After running the above code, you get 400 files with names like, for
example, 'english_33.txt'.  'english_33.txt' contains only words in
English which hash to 33.  Then you need to sort the different files
(like 'english_33.txt') before using them in the next phase.  If you
can sort and dupe-elim in the same step, do that and get rid of the
calls to dupelim below.  Otherwise use dupelim below when reading the
sorted files.  If you must, you might want to do the sort and
dupe-elim in Python:

     for language in ('english', 'german', 'czech', 'polish'):
         for hashcode in range(100):
             filename = '%s_%s.txt' % (language, hashcode)
             source = open(filename)
             lines = sorted(source)
             source.close()
             dest = open(filename, 'w')
             for line in dupelim(lines):
                 dest.write(line)
             dest.close()

 >> def inboth(left, right): ....
 >> def leftonly(left, right): ....
> Aaaah, so the functions just walk one line in left, one line in right, 
> if values don't match the value in left is unique, it walks again one
> line in left and checks if it already matches the value in right file
> in the last position, and so on until it find same value in the right file?

Exactly, you must make sure you deal with cases where you pass values,
match values, and run out of source -- that keeps these from being four-
liners.

 >> For example:
 >>
 >>     Ponly = open('polishonly.txt', 'w')
 >>     every = open('every.txt', 'w')
 >>     for hashcode in range(100):
 >>        English = open('english_%s.txt' % hashcode)
 > ^^^^^^^^^^^^^^^^^^^^^^^^^^ this is some kind of eval?
If  hashcode == 33  then 'english_%s.txt' % hashcode == 'english_33.txt'
So we are just finding the language-specific for-this-hash source.

 >>...     for unmatched in leftonly(leftonly(leftonly(dupelim(Polish),
 >>                 dupelim(English)), dupelim(German)), dupelim(Czech)):
 >>            Ponly.write(unmatched)
 >>         for source in (Polish, English, German, Czech):#(paraphrased)
 >>             source.seek(0)
 >>         for match in inboth(inboth(dupelim(Polish), upelim(English)),
 >>                           inboth(dupelim(German), dupelim(Czech))):
 >>             every.write(match)
 >>         for source in (Polish, English, German, Czech):#(paraphrased)
 >>             source.close()
 >>     Ponly.close()
 >>     every.close()

--Scott David Daniels
Scott.Daniels at Acm.Org