Writing huge Sets() to disk

Mon Jan 10 18:05:02 EST 2005

Dear Scott,
  thatnk you for you excellent email. I also thought about
using some zip() function to compress the strings first before
using them as keys in a dict.

But I don't think I can use one-way hashes, as I need to reconstruct
the string later. I have to study hard to get an idea
what the proposed code really does.

Scott David Daniels wrote:
> Tim Peters wrote:
> 
>> [Martin MOKREJŠ]
>>
>>> just imagine, you want to compare how many words are in English, German,
>>> Czech, Polish disctionary. You collect words from every language and 
>>> record
>>> them in dict or Set, as you wish.
>>
>>
>> Call the set of all English words E; G, C, and P similarly.
>>
>>> Once you have those Set's or dict's for those 4 languages, you ask
>>> for common words
>>
>>
>> This Python expression then gives the set of words common to all 4:
>>     E & G & C & P
>>
>>> and for those unique to Polish.
>>
>>     P -  E - G  - C
> 
> 
> One attack is to define a hash to get sizes smaller.  Suppose you find
> your word sets are 10**8 large, and you find you need to deal with sets
> of size 10**6.  Then if you define a hash that produces an integer below
> 100, you'll find:
>     E & G & C & P == Union(E_k & G_k & C_k & P_k) for k in range(100)
>     P -  E - G  - C == Union(P_k - E_k - G_k - C_k) for k in range(100)
> where X_k = [v for v in X if somehash(v) == k]

I don't understand here what would be the result from the hashing function. :(
Can you clarify this please in more detail? P is the largest dictionary
in this example, or was it shifted by one to the right?

> This means, you can separate the calculation into much smaller buckets,
> and combine the buckets back at the end (without any comparison on the
> combining operations).

Yes.

> 
> For calculating these two expressions, you can dump values out to a
> file per hash per language, sort and dupe-elim the contents of the
> various files (maybe dupe-elim on the way out).  Then hash-by-hash,
> you can calculate parts of the results by combining iterators like
> inboth and leftonly below on iterators producing words from files.
> 
> 
>     def dupelim(iterable):
>         source = iter(iterable)
>         former = source.next()  # Raises StopIteration if empty
>         yield former
>         for element in source:
>             if element != former:
>                 yield element
>                 former = element
> 
>     def inboth(left, right):
>         '''Take ordered iterables and yield matching cases.'''
>         lsource = iter(left)
>         lhead = lsource.next()  # May StopIteration (and we're done)
>         rsource = iter(right)
>         for rhead in rsource:
>             while lhead < rhead:
>                 lhead = lsource.next()
>             if lhead == rhead:
>                 yield lhead
>                 lhead = lsource.next()
> 
>     def leftonly(left, right):
>         '''Take ordered iterables and yield matching cases.'''
>         lsource = iter(left)
>         rsource = iter(right)
>         try:
>             rhead = rsource.next()
>         except StopIteration: # empty right side.
>             for lhead in lsource:
>                 yield lhead
>         else:
>             for lhead in lsource:
>                 try:
>                     while rhead < lhead:
>                         rhead = rsource.next()
>                     if lhead < rhead:
>                         yield lhead
>                 except StopIteration: # Ran out of right side.
>                     yield lhead
>                     for lhead in lsource:
>                         yield lhead
>                     break

Aaaah, so the functions just walk one line in left, one line in right, if values don't match
the value in left is unique, it walks again one line in left and checks if it already
matches the value in riught file in the last position, and so on untill it find
same value in the right file?

So, it doesn't scan the file on right n-times, but only once?
Yes, nice thing. Then I really don't need an index and finally I really believe
flatfiles will do just fine.

> 
>> Real-word dictionaries shouldn't be a problem.  I recommend you store
>> each as a plain text file, one word per line.  Then, e.g., to convert
>> that into a set of words, do
>>
>>     f = open('EnglishWords.txt')
>>     set_of_English_words = set(f)
>>     f.close()
>>
>> You'll have a trailing newline character in each word, but that
>> doesn't really matter.
>>
>> Note that if you sort the word-per-line text files first, the Unix
>> `comm` utility can be used to perform intersection and difference on a
>> pair at a time with virtually no memory burden (and regardless of file
>> size).
> 
> In fact, once you've sorted these files, you can use the iterators
> above to combine those sorted files.
> 
> For example:
> 
>     Ponly = open('polishonly.txt', 'w')
>     every = open('every.txt', 'w')
>     for hashcode in range(100):
>         English = open('english_%s.txt' % hashcode)

--------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^ this is some kind of eval?

>         German = open('german_%s.txt' % hashcode)
>         Czech = open('czech_%s.txt' % hashcode)
>         Polish = open('polish_%s.txt' % hashcode)
>         for unmatched in leftonly(leftonly(leftonly(dupelim(Polish),
>                 dupelim(English)), dupelim(German)), dupelim(Czech)):
>             Ponly.write(unmatched)
>         English.seek(0)
>         German.seek(0)
>         Czech.seek(0)
>         Polish.seek(0)
>         for matched in inboth(inboth(dupelim(Polish), dupelim(English)),
>                               inboth(dupelim(German), dupelim(Czech))):
>             every.write(matched)
>         English.close()
>         German.close()
>         Czech.close()
>         Polish.close()
>     Ponly.close()
>     every.close()
> 
> 
> --Scott David Daniels
> Scott.Daniels at Acm.Org