[Tutor] Comparing two CSV filess using Python

Fri Feb 13 10:36:31 CET 2015

andy van wrote:

> Hi, I'm trying to compare two CSV files (and many more like these below).
> I tried many ways, using lists, dictreader and more but nothing gave me
> the output I require. I want to compare all those rows that have same
> !Sample_title and !Sample_geo_accession values (whose positions vary).
> I've been struggling with this for three days now and couldn't come to a
> solution. I highly appreciate any help.

Basically you need to put the csv into dicts that map title/geo to the whole 
row. You can then easily get the common rows' keys with set operations.

For each pair of row dicts you can then compare the data

added = newrow.keys() - oldrow.keys()   # just dump the values
removed = oldrow.keys() - newrow.keys() # just dump the values
common = newrow.keys() & oldrow.keys()  # need to compare the values 
                                        # to suppress unchanged ones
> 
> CSV1:
> 
> 
!Sample_title,!Sample_geo_accession,!Sample_status,!Sample_type,!Sample_source_name_ch1
> body,GSM501443,Public on july 22 2010,ribonucleic acid,FB_50_12wk
> foreign,GSM501445,Public on july 22 2010,ribonucleic acid,FB_0_12wk
> HJCENV,GSM501446,Public on july 22 2010,ribonucleic acid,FB_50_12wk
> AsDW,GSM501444,Public on july 22 2010,ribonucleic acid,FB_0_12wk
> 
> CSV2:
> 
> !Sample_title,!Sample_type,!Sample_source_name_ch1,!Sample_geo_accession
> AsDW,ribonucleic acid,FB_0,GSM501444
> foreign,ribonucleic acid,FB,GSM501449
> HJCENV,RNA,12wk,GSM501446
> 
> Desired output (with respect to CSV2):
> 
> Added:
> {!Sample_status:{HJCENV:Public on july 22 2010,AsDW:Public on july 22
> 2010}} #Added columns, not rows.
> 
> Deleted:
> {} #Since nothing's deleted with respect to CSV2
> 
> Changed:
> 
> {!Sample_title:AsDW,!Sample_source_name_ch1:
(FB_0_12wk,FB_0),!Sample_geo_accession:GSM501444
> !Sample_title:HJCENV,!Sample_type:(ribonucleic
> acid,RNA),!Sample_source_name_ch1:
(FB_50_12wk,12wk),!Sample_geo_accession:GSM501446}
> #foreign,ribonucleic acid,FB,GSM501449 doesn't come here since the
> !Sample_geo_accession column value didn't match.

If you want to cheat, here's an implementation (requires Python 3):

import csv

def get_key(row):
    return row["!Sample_title"], row["!Sample_geo_accession"]

def load_csv(filename):
    """Put csv data into a dict that maps title/geo to the complete row.
    """
    d = {}
    with open(filename) as f:
        for row in csv.DictReader(f, delimiter=","):
            key = get_key(row)
            assert key not in d
            d[key] = row
    return d

def diffs(old, new):
    yield from added_or_removed("ADDED", new.keys() - old.keys(), new)
    yield from added_or_removed("REMOVED", old.keys() - new.keys(), old)
    yield from changed(old, new)

def compare_row(key, old, new):
    i = -1
    for i, line in enumerate(diffs(old, new)):
        if not i:
            print("/".join(key))
        print("    " + line)
    if i >= 0:
        print()

def added_or_removed(state, keys, d):
    items = sorted((key, d[key]) for key in keys)
    for key, value in items:
        yield "{:10}: {:30} | {:30}".format(state, key, value)

def changed(old, new):
    common_columns = old.keys() & new.keys()
    for column in sorted(common_columns):
        oldvalue = old[column]
        newvalue = new[column]
        if oldvalue != newvalue:
            yield "{:10}: {:30} | {:30} | {:30}".format(
            "CHANGED",
            column, 
            oldvalue.ljust(30),
            newvalue.ljust(30))

if __name__ == "__main__":
    oldcsv = load_csv("2.csv")
    newcsv = load_csv("1.csv")
    # title/geo pairs that occur in both files:
    common = oldcsv.keys() & newcsv.keys() 
    for key in sorted(common):
        compare_row(key, oldcsv[key], newcsv[key])