[Tutor] removal of duplicates from .csv files

Rob Andrews randrews@planhouse.com
Thu, 25 Jan 2001 13:52:43 -0600


I have been given several comma-delimited (.csv) files, each containing as
many as several thousand lines of entries.  Among the tasks I've been
charged with is to remove duplicate entries.  The files each contain fields
for Contact Name, Company Name, Phone Number, and Address, among other
fields, which vary from file to file.

I'm trying to determine a good way to sort for duplicates according to Phone
Number and according to Address.  It seems that sorting by Phone Number
would be simpler due to minor differences in the way data entry clerks might
have input the addresses (W, W., and West, for instance), but not all
entries have phone numbers.

I have already come up with some code to work around most of the obvious
problems with the different ways addresses may be input, but I'm not sure
what the best way to sort for duplicates might be.  One suggestion I have
received is to have lines with identical fields placed back to back with an
equivalence check and manually read through the file.  The equivalence check
itself seems simple, but I'm not sure how to scan only the target field
(split(), maybe?), and I certainly want to avoid having to manually remove
the duplicates afterward.

Has anyone already worked out a good approach?

TIA,
Rob Andrews
Useless Python Repository
http://www.lowerstandard.com/python/pythonsource.html