Orders of magnitude
Robert Brewer
fumanchu at amor.org
Mon Mar 29 13:57:32 EST 2004
I wrote:
I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?
...and got replies:
[PF]
> When you look for dupes, you must hold only the keys
> in memory, not the data (it'll be a lot faster this way).
Yup. The data is handled one line at a time from files; I only retain an
index for certain fields. The 'non-key' data comes and goes quickly.
[Peter Otten]
> You might try a minor change:
>
> for i in x:
> c[i] = None
Thanks, but I need the test. If i is already in x, I need to delete that
field from the current row.
[Josiah Carlson]
> I would suggest a multi-phase method. Break the sequence up into 100k
> element blocks each and remove duplicates using dictionaries. Sort
each
> block individually and write each to a file.
> Merge the 100 files by using a heapq.
I'll look into it. I was trying to do the whole thing in one pass of the
original dataset, but I might be able to handle two or more passes (the
source files have additional info for each record which is not included
in the indices). I don't *think* the source files will mutate
in-between. ;)
Currently, the dict method takes ~ 2 hours, and the bsddb method
(in-memory btrees) took ~ 18 hours.
Thanks, all!
Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org
More information about the Python-list
mailing list