removing duplicates from .csv files

mspiggie at my-deja.com mspiggie at my-deja.com
Fri Jan 26 12:10:08 EST 2001


In article <94ptre$uf0$1 at nnrp1.deja.com>,
  mspiggie at my-deja.com wrote:
> I have been given several comma-delimited (.csv) files, each
containing
> as many as several thousand lines of entries.  Among the tasks I've
> been charged with is to remove duplicate entries.  The files each
> contain fields for Contact Name, Company Name, Phone Number, and
> Address, among other fields, which vary from file to file.
>
> I'm trying to determine a good way to sort for duplicates according to
> Phone Number and according to Address.  It seems that sorting by Phone
> Number would be simpler due to minor differences in the way data entry
> clerks might have input the addresses (W, W., and West, for instance),
> but not all entries have phone numbers.
>
> I have already come up with some code to work around most of the
> obvious problems with the different ways addresses may be input, but
> I'm not sure what the best way to sort for duplicates might be.  One
> suggestion I have received is to have lines with identical fields
> placed back to back with an equivalence check and manually read
through
> the file.  The equivalence check itself seems simple, but I'm not sure
> how to scan only the target field (split(), maybe?), and I certainly
> want to avoid having to manually remove the duplicates afterward.
>
> Has anyone already worked out a good approach?
>
> TIA,
> Rob Andrews

So far I've received several bits of input on this, so thanks to all
parties (and of course keep it coming ;-> ).  The suggestions thus far
seem to involve combinations of manual labor, Perl, MySQL, Excel, and
Unix commands.  This makes me even more determined to develop as much
of the solution as I can in Python and open the source for others to
use later on in similar situations.  I don't know how much of it I'll
pull off in time for my project deadline, but maybe I can pull it off
in time for some random stranger's deadline later on.

who-needs-sleep-ly y'rs,
Rob


Sent via Deja.com
http://www.deja.com/



More information about the Python-list mailing list