removing duplicates from .csv files

Thu Jan 25 14:12:56 EST 2001

I have been given several comma-delimited (.csv) files, each containing
as many as several thousand lines of entries.  Among the tasks I've
been charged with is to remove duplicate entries.  The files each
contain fields for Contact Name, Company Name, Phone Number, and
Address, among other fields, which vary from file to file.

I'm trying to determine a good way to sort for duplicates according to
Phone Number and according to Address.  It seems that sorting by Phone
Number would be simpler due to minor differences in the way data entry
clerks might have input the addresses (W, W., and West, for instance),
but not all entries have phone numbers.

I have already come up with some code to work around most of the
obvious problems with the different ways addresses may be input, but
I'm not sure what the best way to sort for duplicates might be.  One
suggestion I have received is to have lines with identical fields
placed back to back with an equivalence check and manually read through
the file.  The equivalence check itself seems simple, but I'm not sure
how to scan only the target field (split(), maybe?), and I certainly
want to avoid having to manually remove the duplicates afterward.

Has anyone already worked out a good approach?

TIA,
Rob Andrews

Sent via Deja.com
http://www.deja.com/