extracting duplicates from CSV file by specific fields
VP
vadim.pestovnikov at gmail.com
Wed Apr 29 01:52:37 EDT 2009
Thanks guys!
Tested, seems working.
CSV file:
---------
"a.a","sn-01"
"b.b","sn-02"
"c.c","sn-03"
"d.d","sn-04"
"e.e","sn-05"
"f.f","sn-06"
"g.g","sn-07"
"h.h","sn-08"
"i.i","sn-09"
"a.a","sn-10"
"k.k","sn-02"
"i.i","sn-09"
Source:
---------
#!/usr/bin/env python
import csv
unqs = []
dups = []
seen_in_field0 = set()
seen_in_field1 = set()
reader = csv.reader(open("myfile.csv", "rb"))
print "\nOriginals:\n"
for row in reader:
print row
if row[0] in seen_in_field0 or row[1] in seen_in_field1:
dups.append(row)
else:
seen_in_field0.add(row[0])
seen_in_field1.add(row[1])
unqs.append(row)
print "\nUniques:\n"
for row in unqs:
print row
print "\nDuplicates:\n"
for row in dups:
print row
print "\n"
Result:
---------
Originals:
['a.a', 'sn-01']
['b.b', 'sn-02']
['c.c', 'sn-03']
['d.d', 'sn-04']
['e.e', 'sn-05']
['f.f', 'sn-06']
['g.g', 'sn-07']
['h.h', 'sn-08']
['i.i', 'sn-09']
['a.a', 'sn-10']
['k.k', 'sn-02']
['i.i', 'sn-09']
Uniques:
['a.a', 'sn-01']
['b.b', 'sn-02']
['c.c', 'sn-03']
['d.d', 'sn-04']
['e.e', 'sn-05']
['f.f', 'sn-06']
['g.g', 'sn-07']
['h.h', 'sn-08']
['i.i', 'sn-09']
Duplicates:
['a.a', 'sn-10']
['k.k', 'sn-02']
['i.i', 'sn-09']
More information about the Python-list
mailing list