Sorting in huge files

Tue Dec 7 16:34:43 EST 2004

Paul wrote:
> I expect a few repeats for most of the keys, and that s actually part
> of what I want to figure out in the end. (Said loosely, I want to group
> all the data entries having "similar" keys. For this I need to sort the
> keys first (data entries having _same_ key), and then figure out which
> keys are "similar").

If this is really your final goal, you may not want to sort.  Consider 
code like the following:

 >>> entries = [('a', '4'),
...            ('x', '7'),
...            ('a', '2'),
...            ('b', '7'),
...            ('x', '4')]
 >>> counts = {}
 >>> for entry in entries:
...     key = entry[0]
...     counts.setdefault(key, []).append(entry)
...
 >>> for key in counts:
...     print key, counts[key]
...
a [('a', '4'), ('a', '2')]
x [('x', '7'), ('x', '4')]
b [('b', '7')]

I've grouped all entries with the same key together using a dict object 
and without the need for any sorting.  If you had a good definition of 
'similar', you could perhaps map all 'similar' keys to the same value in 
the dict.

If you really do need to sort, Python 2.4 provides a very nice way to 
sort by a particular key:

 >>> import operator
 >>> entries = [('a', '4'),
...            ('x', '7'),
...            ('a', '2'),
...            ('b', '7'),
...            ('x', '4')]
 >>> entries.sort(key=operator.itemgetter(1))
 >>> entries
[('a', '2'), ('a', '4'), ('x', '4'), ('x', '7'), ('b', '7')]

Here, I've sorted the entries by the second item in each tuple.  If you 
go this route, you should also look at itertools.groupby:

 >>> import itertools
 >>> entries = [('a', '4'),
...            ('x', '7'),
...            ('a', '2'),
...            ('b', '7'),
...            ('x', '4')]
 >>> entries.sort(key=operator.itemgetter(1))
 >>> for key, values in itertools.groupby(entries, operator.itemgetter(1)):
...     print key, list(values)
...
2 [('a', '2')]
4 [('a', '4'), ('x', '4')]
7 [('x', '7'), ('b', '7')]

The groupby basically does the sort of grouping of a sorted list that I 
think you had in mind...

Steve