Pythonic search of list of dictionaries

Tue Jan 4 17:33:34 EST 2005

Bulba! wrote:
[big snip]

Forget the csv-induced dicts for the moment, they're just an artifact
of your first solution attempt. Whether english = csv_row[1], or
english = csv_row_dict["english"], doesn't matter yet. Let's take a few
steps back, and look at what you are trying to do through a telescope
instead of a microscope.

Core basic root problem: you have a table with 3 columns, id, english,
polish. Nothing is said about id so let's assume nothing. Evidently
contents of "english" is the "key" for this exercise, but it's not
necessarily unique. You have two instances of the table, and you want
to "diff" them using "english" as the key.

You want to collect together all rows that have the same value of
"english", and then process them somehow. You need to define an object
containing a list of all "left" rows and a list of all "right" rows.

Processing the contents of the object: do whatever you have to with
obj.left_list and obj.right_list depending on the lengths; you have 3
cases of length to consider (0, 1, many). 3 x 3 = 9 but if you get both
zero you have a bug (which you should of course assert does not exist),
so that leaves 8 cases to think about.

Now, how do we get the rows together:

(a) The sort method

Step 1: sort each dataset on (english, id, polish) or (english, polish,
id) -- your choice; sorting on the whole record instead just english
makes the ordering predictable and repeatable.

Step 2: read the two sorted datasets ("left" and "right") in parallel:

when left key < right key: do_stuff(); read another left record
when left key > right key: converse
when left ley == right key: do_stuff(); read another record for both
where do_stuff() includes appending to left_list and right_list as
appropriate and at the right moment, process a completed object.
This is a little tricky, and handling end of file needs a little care.
However this algorithm can be implemented with minimal memory ("core"),
no disk drive at all :-O and a minimum of 3 (preferably 4+)
serially-readable rewindable re-writable storage devices e.g magnetic
tape drives. Once upon a time, it was *the* way of maintaining a
database (left = old database, right = transactions, output file -> new
database).

(b) Fast forwarding 30+ years, let's look at the dictionary method,
assuming you have enough memory to hold all your data at once:

Step 1: read the "left" table; for each row, if english not in mydict,
then do mydict[english] = MyObject(). In any case, do
mydict[english].left_list.append(row)
Step 2: same for the "right" table.
Step 3: for english, obj in mydict.iteritems(): process(english, obj)

As your datasets are stored in MS Excel spreadsheets, N < 64K so
whether your solution is O(N) or O(N*log(N)) doesn't matter too much.
You are however correct to avoid O(N**2) solutions.

Hoping this sketch of the view through the telescope (and the
rear-vision mirror!) is helpful,
John