removing duplicates from .csv files

Mark Wilson m.wilson at bibliocraft.com
Thu Jan 25 18:22:42 EST 2001


Rob Andrews:
>I have been given several comma-delimited (.csv) files,
> each containing as many as several thousand lines of
>entries.  Among the tasks I've been charged with is to
>remove duplicate entries. The files each contain fields for
>Contact Name, Company Name, Phone Number, and Address,
>among other fields, which vary from file to file.
[ ... ]
>Has anyone already worked out a good approach?

Can't say that we've "worked out" something, but I do have
experience de-duping a similar kind of data set.

Maybe this seems like overkill, but I'd suggest stuffing all
your data into a database (one with an interface to Python
of course - eg MySQL) and then use database queries to
locate records with duplicate entries. One benefit of this
approach is that you can use "Like" (with wildcards) to
locate a series of records. This will allow you to identify
that "Mr X. Y. Smokealot" is the same guy as "Xavier
smokealot" (using a select where name like
"%x%smokealot%" ). MySQL also supports regular expressions
in its where statements, which may be useful.

Obviously, you'll need to write code to scan through what's
there (eg get the list of all names), abstract it a bit
(get just surname and initial ) generate suitable
queries therefrom, and examine the results ( eg check that
X. Y. Smokealot works at the same place as Xavier
smokealot ). Given the nature of the inconsistencies you're
likely to find in a data set prepared by a human being, you
could spend a long time writing such code to catch all
possibilities, so I'd also suggest that a framework in which
you can perform manual searches, and edit records (including
copy/paste of text) is going to be essential to get the data
perfect.

Its pretty easy to get MySql up and running, and hooked into
Python, and getting your data into MySQL is trivial (OK your
files contain different fields, but you could account for
that using Python to import data, or use Excel to create a
set of csv files with common fields and import directly into
the database). You could then spend time writing Python code
to focus on the more esoteric de-duping issues, rather than
re-inventing functionality that is already present in a
database engine. Some of the UI bits that come with MySQL
could be handy too for those final tweaks.

Others may have other ideas, but I hope that helps.

Mark























More information about the Python-list mailing list