Sorting in huge files
Steven Bethard
steven.bethard at gmail.com
Tue Dec 7 16:34:43 EST 2004
Paul wrote:
> I expect a few repeats for most of the keys, and that s actually part
> of what I want to figure out in the end. (Said loosely, I want to group
> all the data entries having "similar" keys. For this I need to sort the
> keys first (data entries having _same_ key), and then figure out which
> keys are "similar").
If this is really your final goal, you may not want to sort. Consider
code like the following:
>>> entries = [('a', '4'),
... ('x', '7'),
... ('a', '2'),
... ('b', '7'),
... ('x', '4')]
>>> counts = {}
>>> for entry in entries:
... key = entry[0]
... counts.setdefault(key, []).append(entry)
...
>>> for key in counts:
... print key, counts[key]
...
a [('a', '4'), ('a', '2')]
x [('x', '7'), ('x', '4')]
b [('b', '7')]
I've grouped all entries with the same key together using a dict object
and without the need for any sorting. If you had a good definition of
'similar', you could perhaps map all 'similar' keys to the same value in
the dict.
If you really do need to sort, Python 2.4 provides a very nice way to
sort by a particular key:
>>> import operator
>>> entries = [('a', '4'),
... ('x', '7'),
... ('a', '2'),
... ('b', '7'),
... ('x', '4')]
>>> entries.sort(key=operator.itemgetter(1))
>>> entries
[('a', '2'), ('a', '4'), ('x', '4'), ('x', '7'), ('b', '7')]
Here, I've sorted the entries by the second item in each tuple. If you
go this route, you should also look at itertools.groupby:
>>> import itertools
>>> entries = [('a', '4'),
... ('x', '7'),
... ('a', '2'),
... ('b', '7'),
... ('x', '4')]
>>> entries.sort(key=operator.itemgetter(1))
>>> for key, values in itertools.groupby(entries, operator.itemgetter(1)):
... print key, list(values)
...
2 [('a', '2')]
4 [('a', '4'), ('x', '4')]
7 [('x', '7'), ('b', '7')]
The groupby basically does the sort of grouping of a sorted list that I
think you had in mind...
Steve
More information about the Python-list
mailing list