Deduping (was Fake post)

Robert Brewer fumanchu at amor.org
Fri Apr 2 13:00:19 EST 2004


Christian Tismer wrote:
> That's nice to hear! I'm interested to know how you got
> along with removing relationships between records.

With a bit of experimentation, I think the memory issue came down to
having a single dict that was too large; at approx 5.5 million items, I
believe the dict resized again and ended up thrashing in virtual memory.
The quick and simple solution was simply to create 256 dictionaries,
distributing them over the first byte. Since this worked with a dataset
of 20 million items, I didn't bother taking the next step of
SHA-digesting them.

Since I had 3 fields on which to independently dedupe, I ended up
building each list in a separate pass, as you did. Then, once the full
list had been built, I made a single new dict out of the intermediate
256 dicts, passing on any item which only appeared once (this wouldn't
work for every situation, but it fit my dataset pretty well, with about
10%-50% duplicates):

        product = {}
        for subindex in index.itervalues():
            for i in xrange(len(subindex)):
                key, hits = subindex.popitem()
                if hits > 1:
                    product[key] = hits
        return product

I also did some validation on this pass, btw, before inserting into the
index.

Then, I did a second pass on the dataset where I checked for each value
(in one of the indexed fields) in the appropriate list of duplicates. If
present, I wrote a blank field per my client's request. As I went, I
decremented the hitcounts until I reached 1: this should have been the
last occurrence of the value, which I printed normally.

        # Check for dupes
        if atom in dupes[fieldname]:
            if dupes[fieldname][atom] <= 1:
                del dupes[fieldname][atom]
            else:
                dupes[fieldname][atom] -= 1
                atom = ''


The 20-million-record dataset took about 4 hours in total, mostly due to
my typing at the keyboard and manually examining each source file to see
what order the columns were in. The actual index-building took a few
minutes.


Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org




More information about the Python-list mailing list