[Tutor] Comparing two CSV filess using Python
Peter Otten
__peter__ at web.de
Fri Feb 13 10:36:31 CET 2015
andy van wrote:
> Hi, I'm trying to compare two CSV files (and many more like these below).
> I tried many ways, using lists, dictreader and more but nothing gave me
> the output I require. I want to compare all those rows that have same
> !Sample_title and !Sample_geo_accession values (whose positions vary).
> I've been struggling with this for three days now and couldn't come to a
> solution. I highly appreciate any help.
Basically you need to put the csv into dicts that map title/geo to the whole
row. You can then easily get the common rows' keys with set operations.
For each pair of row dicts you can then compare the data
added = newrow.keys() - oldrow.keys() # just dump the values
removed = oldrow.keys() - newrow.keys() # just dump the values
common = newrow.keys() & oldrow.keys() # need to compare the values
# to suppress unchanged ones
>
> CSV1:
>
>
!Sample_title,!Sample_geo_accession,!Sample_status,!Sample_type,!Sample_source_name_ch1
> body,GSM501443,Public on july 22 2010,ribonucleic acid,FB_50_12wk
> foreign,GSM501445,Public on july 22 2010,ribonucleic acid,FB_0_12wk
> HJCENV,GSM501446,Public on july 22 2010,ribonucleic acid,FB_50_12wk
> AsDW,GSM501444,Public on july 22 2010,ribonucleic acid,FB_0_12wk
>
> CSV2:
>
> !Sample_title,!Sample_type,!Sample_source_name_ch1,!Sample_geo_accession
> AsDW,ribonucleic acid,FB_0,GSM501444
> foreign,ribonucleic acid,FB,GSM501449
> HJCENV,RNA,12wk,GSM501446
>
> Desired output (with respect to CSV2):
>
> Added:
> {!Sample_status:{HJCENV:Public on july 22 2010,AsDW:Public on july 22
> 2010}} #Added columns, not rows.
>
> Deleted:
> {} #Since nothing's deleted with respect to CSV2
>
> Changed:
>
> {!Sample_title:AsDW,!Sample_source_name_ch1:
(FB_0_12wk,FB_0),!Sample_geo_accession:GSM501444
> !Sample_title:HJCENV,!Sample_type:(ribonucleic
> acid,RNA),!Sample_source_name_ch1:
(FB_50_12wk,12wk),!Sample_geo_accession:GSM501446}
> #foreign,ribonucleic acid,FB,GSM501449 doesn't come here since the
> !Sample_geo_accession column value didn't match.
If you want to cheat, here's an implementation (requires Python 3):
import csv
def get_key(row):
return row["!Sample_title"], row["!Sample_geo_accession"]
def load_csv(filename):
"""Put csv data into a dict that maps title/geo to the complete row.
"""
d = {}
with open(filename) as f:
for row in csv.DictReader(f, delimiter=","):
key = get_key(row)
assert key not in d
d[key] = row
return d
def diffs(old, new):
yield from added_or_removed("ADDED", new.keys() - old.keys(), new)
yield from added_or_removed("REMOVED", old.keys() - new.keys(), old)
yield from changed(old, new)
def compare_row(key, old, new):
i = -1
for i, line in enumerate(diffs(old, new)):
if not i:
print("/".join(key))
print(" " + line)
if i >= 0:
print()
def added_or_removed(state, keys, d):
items = sorted((key, d[key]) for key in keys)
for key, value in items:
yield "{:10}: {:30} | {:30}".format(state, key, value)
def changed(old, new):
common_columns = old.keys() & new.keys()
for column in sorted(common_columns):
oldvalue = old[column]
newvalue = new[column]
if oldvalue != newvalue:
yield "{:10}: {:30} | {:30} | {:30}".format(
"CHANGED",
column,
oldvalue.ljust(30),
newvalue.ljust(30))
if __name__ == "__main__":
oldcsv = load_csv("2.csv")
newcsv = load_csv("1.csv")
# title/geo pairs that occur in both files:
common = oldcsv.keys() & newcsv.keys()
for key in sorted(common):
compare_row(key, oldcsv[key], newcsv[key])
More information about the Tutor
mailing list