Writing huge Sets() to disk
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Fri Jan 14 11:12:32 EST 2005
Tim Peters wrote:
> [Martin MOKREJŠ]
>
>>...
>>
>>I gave up the theoretical approach. Practically, I might need up
>>to store maybe those 1E15 keys.
>
>
> We should work on our multiplication skills here <wink>. You don't
> have enough disk space to store 1E15 keys. If your keys were just one
> byte each, you would need to have 4 thousand disks of 250GB each to
> store 1E15 keys. How much disk space do you actually have? I'm
> betting you have no more than one 250GB disk.
>
> ...
>
> [Istvan Albert]
>
>>>On my system storing 1 million words of length 15
>>>as keys of a python dictionary is around 75MB.
>
>
>>Fine, that's what I wanted to hear. How do you improve the algorithm?
>>Do you delay indexing to the very latest moment or do you let your
>>computer index 999 999 times just for fun?
>
>
> It remains wholly unclear to me what "the algorithm" you want might
> be. As I mentioned before, if you store keys in sorted text files,
> you can do intersection and difference very efficiently just by using
> the Unix `comm` utiltity.
This comm(1) approach doesn't work for me. It somehow fails to detect
common entries when the offset is too big.
file 1:
A
F
G
I
K
M
N
R
V
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
AIK
FGR
FRA
GFG
GIN
GRI
IGI
IGR
IKI
ING
IVF
KIG
MAI
NGF
RAA
RIG
file 2:
W
W
W
W
W
W
W
W
W
W
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
More information about the Python-list
mailing list