Orders of magnitude
Robert Brewer
fumanchu at amor.org
Tue Mar 30 02:29:34 EST 2004
Christian Tismer wrote:
> >>Robert Brewer wrote:
> >>
> >>>I'm dedup'ing a 10-million-record dataset, trying different
> >>>approaches
> >>>for building indexes. The in-memory dicts are clearly
> >>>faster, but I get
> >>>Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any
> > But I don't see how your solution avoids the problem,
> however (some code
> > snipped):
> >
> > def process_duplicates(self):
> > for i in range(len(self.files)):
> > seen = {}
> > has = seen.has_key
> > for k in range(self.flushes):
> > bin = pickle.load(self.files[i])
> > for key, value in bin:
> > if has(key):
> > self.duplicates += 1
> > else:
> > seen[key] = value
>
> Just try it, it works!
>
> [discussion of hashing snipped]
Thanks again. Really, thanks!
However, I think you're letting the arrow fly before you've seen the
target. I understand hashing and making a bucket for each possible first
character (I should have read your code more thoroughly). That is a fine
technique, and it was already on my short list of workarounds. However,
your _complete_ solution isn't feasible for my dataset, because I'm
comparing some fields within each record, not entire records. For
example, with rows like:
"Michael","Kemperson","26002","me at aol.com","64.12.104.18"
"Ryan","Thomas","46723","myfriend at yahoo.com","209.244.21.23"
"Vicki","Thomas","46723","myfriend at yahoo.com","209.244.21.23"
"Shanuana","Hedlund","10415","andi at yahoo.com","122.150.184.175"
...I need to keep Ryan and dump Vicki, based on shared email and IP, but
*not* on the other fields. If Vicki had the same email but a different
IP, don't drop that record, but blank out the 2nd phone number. In
addition, most but not all files have phone numbers, which get the same
treatment. That's why I said I have three dicts/indexes.
No, don't rewrite it! :D I have even *more* requirements, which are
mostly in my code but not really documented, and some are changing. I'm
working on a more scalable solution, since I then need to apply the same
code to a 20-million-record DB. I'm doing fine--just wondered in
particular about ways to optimize that single step. Thanks for your
help.
Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org
More information about the Python-list
mailing list