very large dictionaries

robin escalation746 at yahoo.com
Thu Jun 17 13:44:16 EDT 2004


Let me specify further. Our actual data is enormous, and tests with
Postgres and MySQL indicate that the time taken just to build a single
index is on the order of hours, which is too long. After analysis we
believe that the important lookup info can be compressed to about 40
million records of 48 bytes each. Furthermore, we have the flexibility
to partition this into four somewhat equal parts. Each will hence be
about 460MB in size. Even with structural overhead I see no reason
this couldn't fit into memory. This is our match file.

Our process involves taking an input disk file, and traversing it one
record at a time. This we can sort by the same partition criteria used
to split the match file (so that each match file can be loaded but
once). For each input record we build a series of keys and compare
them to the appropriate match file; thus there are multiple lookups
per input record. An algorithm then determines which match record we
wish to pick and we write an ID to the input.

There is more to it than this, but these are the major elements. The
input table may be millions of records long, but likely will be much
smaller.

The total processing time will be a sum of:
	time to sort/index input file
	time to traverse input file
	time to load in all parts of the match table
	time to build keys on input records
	time to search match table for each key
	time to write match key ID 
	overhead time of routine

A new wrinkle is that the match table may have duplicate keys, so
storing it as a dictionary is not an option.

The data is alphanumeric.

Assume an arbitrarily powerful computer, since adding a GB of RAM is
not an issue.

The question I had, for those with experience with large data sets, is
what structure would best handle this problem?

-- robin



More information about the Python-list mailing list