Orders of magnitude

Mon Mar 29 13:57:32 EST 2004

I wrote:

I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?

...and got replies:

[PF]
> When you look for dupes, you must hold only the keys
> in memory, not the data (it'll be a lot faster this way).

Yup. The data is handled one line at a time from files; I only retain an
index for certain fields. The 'non-key' data comes and goes quickly.

[Peter Otten]
> You might try a minor change:
> 
> for i in x:
>     c[i] = None

Thanks, but I need the test. If i is already in x, I need to delete that
field from the current row.

[Josiah Carlson]
> I would suggest a multi-phase method.  Break the sequence up into 100k
> element blocks each and remove duplicates using dictionaries.  Sort
each
> block individually and write each to a file.
> Merge the 100 files by using a heapq.

I'll look into it. I was trying to do the whole thing in one pass of the
original dataset, but I might be able to handle two or more passes (the
source files have additional info for each record which is not included
in the indices). I don't *think* the source files will mutate
in-between. ;)

Currently, the dict method takes ~ 2 hours, and the bsddb method
(in-memory btrees) took ~ 18 hours.

Thanks, all!

Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org