Pythonic search of list of dictionaries

Tim Hochberg tim.hochberg at ieee.org
Tue Jan 4 12:10:30 EST 2005


Bulba! wrote:
> Hello everyone,
> 
> I'm reading the rows from a CSV file. csv.DictReader puts
> those rows into dictionaries.
> 
> The actual files contain old and new translations of software
> strings. The dictionary containing the row data looks like this:
> 
>     o={'TermID':'4', 'English':'System Administration',
> 'Polish':'Zarzadzanie systemem'}
> 
> I put those dictionaries into the list:
> 
>    oldl=[x for x in orig]  # where orig=csv.DictReader(ofile ...
> 
> ..and then search for matching source terms in two loops:
> 
>    for o in oldl:
>        for n in newl:
>            if n['English'] == o['English']:
>            ...
> 
> Now, this works. However, not only this is very un-Pythonic, but also
> very inefficient: the complexity is O(n**2), so it scales up very
> badly.
> 
> What I want to know is if there is some elegant and efficient
> way of doing this, i.e. finding all the dictionaries dx_1 ... dx_n,
> contained in a list (or a dictionary) dy, where dx_i  contains
> a specific value. Or possibly just the first dx_1 dictionary.

Sure, just do a little preprocessing. Something like (untested):

####

def make_map(l):
     # This assumes that each English key is unique in a given l
     # if it's not you'll have to use a list of o instead of o itself.
     map = {}
     for d in l:
         if 'English' in d:
             key = d['English']
             map[key] = d

old_map = make_map(oldl)
new_map = make_map(newl)

for engphrase in old_map:
     if engphrase in new_map:
         o = old_map[engphrase]
         n = new_map[engphrase]
         if n['Polish'] == o['Polish']:
             status=''
         else:
             status='CHANGED'
         # process....

####

I've assumed that the English key is unique in both the old and new 
lists. If it's not this will need some adjustment. However, your 
original algorithm is going to behave weirdly in that case anyway 
(spitting out multiple lines with the same id, but potentially different 
new terms and update status).

Hope that's useful.

-tim

> 
> I HAVE to search for values corresponding to key 'English', since
> there are big gaps in both files (i.e. there's a lot of rows 
> in the old file that do not correspond to the rows in the new
> file and vice versa). I don't want to do ugly things like converting
> dictionary to a string so I could use string.find() method. 
> 
> Obviously it does not have to be implemented this way. If
> data structures here could be designed in a proper 
> (Pythonesque ;-) way, great. 
> 
> I do realize that this resembles doing some operation on 
> matrixes.  But I have never tried doing smth like this in 
> Python.
> 
> 
> #---------- Code follows ---------
> 
> import sys
> import csv
> 
> class excelpoldialect(csv.Dialect):
>     delimiter=';'
>     doublequote=True
>     lineterminator='\r\n'
>     quotechar='"'
>     quoting=0
>     skipinitialspace=False
> 
> epdialect=excelpoldialect()
> csv.register_dialect('excelpol',epdialect)
> 
> 
> try:
>     ofile=open(sys.argv[1],'rb')
> except IOError:
>     print "Old file %s could not be opened" % (sys.argv[1])
>     sys.exit(1)
> 
> try:
>     tfile=open(sys.argv[2],'rb')
> except IOError:
>     print "New file %s could not be opened" % (sys.argv[2])
>     sys.exit(1)
> 
>     
> titles=csv.reader(ofile, dialect='excelpol').next()
> orig=csv.DictReader(ofile, titles, dialect='excelpol')
> transl=csv.DictReader(tfile, titles, dialect='excelpol')
> 
> cfile=open('cmpfile.csv','wb')
> titles.append('New')
> titles.append('RowChanged')
> cm=csv.DictWriter(cfile,titles, dialect='excelpol')
> cm.writerow(dict(zip(titles,titles)))
> 
> 
> print titles
> print "-------------"
> 
> oldl=[x for x in orig]
> newl=[x for x in transl]
> 
> all=[]
> 
> for o in oldl:
>     for n in newl:
>         if n['English'] == o['English']:
>             if n['Polish'] == o['Polish']:
>                 status=''
>             else:
>                 status='CHANGED'
>             combined={'TermID': o['TermID'], 'English': o['English'],
> 'Polish': o['Polish'], 'New': n['Polish'], 'RowChanged': status}
>             cm.writerow(combined)
>             all.append(combined)
> 
>             
> # duplicates
> 
> dfile=open('dupes.csv','wb')
> dupes=csv.DictWriter(dfile,titles,dialect='excelpol')
> dupes.writerow(dict(zip(titles,titles)))
> 
> """for i in xrange(0,len(all)-2):
>     for j in xrange(i+1, len(all)-1):
>         if (all[i]['English']==all[j]['English']) and
> all[i]['RowChanged']=='CHANGED':
>             dupes.writerow(all[i])
>             dupes.writerow(all[j])"""
>  
> cfile.close()
> ofile.close()
> tfile.close()
> dfile.close()
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --
> 
> Real world is perfectly indifferent to lies that 
> are the foundation of leftist "thinking".




More information about the Python-list mailing list