searching a value of a dict (each value is a list)

MonkeeSage MonkeeSage at gmail.com
Fri Dec 14 13:32:55 EST 2007


On Dec 10, 1:28 pm, bearophileH... at lycos.com wrote:
> Seongsu Lee:
>
> >I have a dictionary with million keys. Each value in the dictionary has a list with up to thousand integers.<
>
> Let's say each integer can be represented with 32 bits (if there are
> less numbers then a 3-byte representation may suffice, but this makes
> things more complex), that is 2^2 bytes. Let's say there are 2^20 keys
> each one associated to 2^10 values. So to represent the values you
> need 2^32 bytes. It means 4 GB, so I don't think Python suffices to
> store them in RAM, because a Python int object requires quite more
> than 4 bytes (only represented inside an array.array it may need just
> 4 bytes).
>
> So if you can use 128 MB RAM to store such data structure you need to
> store data on HD too. You probably can use a lower-level language. On
> disk you can keep the reverse index, represented as an array of
> records/structs, each of such structures keep two 32-bit numbers (so
> such array is 8 GB). Such index is sorted according to the first
> element of the struct. The first number is the value of the original
> dictionary and the second nuber is its key. Inside the RAM you can
> keep another sorted array that "summarizes" your whole data. When you
> need a number you can do a binary search on the array in RAM, such
> array gives you the position where you can read (with a seek) a little
> part of the file (512 bytes may suffice), to perform a little binary
> search (if the block is very little a linear scan suffices) on it too
> to find the number you need. Note that the summarizing data structure
> in RAM may be represented with just a Python dict too, so in the end
> you can use Python to solve this problem. You may need a lower-level
> language to create the 8 GB file on disk (or create it with Python,
> but it may take lot of time. You may sort it with the sort unix
> command).
>
> This isn't a complete solution, but I think it may work.
>
> Bye,
> bearophile

Nice. :)

Regards,
Jordan



More information about the Python-list mailing list