Writing huge Sets() to disk

Fri Jan 14 11:12:32 EST 2005

Tim Peters wrote:
> [Martin MOKREJŠ]
> 
>>...
>>
>>I gave up the theoretical approach. Practically, I might need up
>>to store maybe those 1E15 keys.
> 
> 
> We should work on our multiplication skills here <wink>.  You don't
> have enough disk space to store 1E15 keys.  If your keys were just one
> byte each, you would need to have 4 thousand disks of 250GB each to
> store 1E15 keys.  How much disk space do you actually have?  I'm
> betting you have no more than one 250GB disk.
> 
> ...
> 
> [Istvan Albert]
> 
>>>On my system storing 1 million words of length 15
>>>as keys of a python dictionary is around 75MB.
> 
> 
>>Fine, that's what I wanted to hear. How do you improve the algorithm?
>>Do you delay indexing to the very latest moment or do you let your
>>computer index 999 999 times just for fun?
> 
> 
> It remains wholly unclear to me what "the algorithm" you want might
> be.  As I mentioned before, if you store keys in sorted text files,
> you can do intersection and difference very efficiently just by using
> the Unix `comm` utiltity.

This comm(1) approach doesn't work for me. It somehow fails to detect
common entries when the offset is too big.

file 1:

A
F
G
I
K
M
N
R
V
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
AIK
FGR
FRA
GFG
GIN
GRI
IGI
IGR
IKI
ING
IVF
KIG
MAI
NGF
RAA
RIG

file 2:

W
W
W
W
W
W
W
W
W
W
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA