no more comparisons

Terry Reedy tjreedy at udel.edu
Thu Mar 13 02:55:59 EDT 2008


"Paul Rubin" <"http://phr.cx"@NOSPAM.invalid> wrote in message 
news:7x3aqvf9o2.fsf at ruckus.brouhaha.com...
| "Terry Reedy" <tjreedy at udel.edu> writes:
| > | I don't see what's so inefficient about it necessarily.
| >
| > The key function is called once per list item, for n calls total.  The
| > comparision function is called once per comparision.  There are at 
least
| > n-1 such calls and typically something on the order of n * lg2(n) 
calls.
|
| Right.  Depending on the application, calling the comparison function
| lg(n) times may be faster than calling the key function once.
|
| Example: you want to sort a list of strings case-independently, each
| string containing a few pages worth of text.  There are 10000 strings,
| each maybe 10000 chars long.

If the strings are pulled in from a file, or stored off to a file, as I 
would expect in and real app of this sort, the in-memory data to be sorted 
could/should be tuples of the lowercased strings and files positions.  Then 
no key param is needed.

|  Then (untested):
|
|   x.sort(xs, key=lower)
|
| makes a copy of each of the 10000 strings, makes 10000 tuples, sorts,
| then releases the temporary strings and tuples.  At least 100
| megabytes of memory allocated and 100 million chars copied, even
| though any give pair of strings will usually differ somewhere in the
| first few characters and there's no need to look past those.
|
|   from string import lower
|   from operator import eq
|   from itertools import *
|
|   def xcmp(a,b):
|      for x,y in izip(a,b)):
|         c = cmp(x.lower() y.lower())
|         if c: return c
|      return 0
|
|   x.sort(xcmp)
|
| runs the comparison function about 14000 times,

14 * 10000 = 140000, not 14000.  For each of these, you have a call to izip 
and multiple calls to .next, cmp, and .lower.

That said, this use case of where the potentially needed key is long 
whereas the actually needed key is much shorter is a better use case than 
anything I have seen so far.

Still, as I figure it, the initial part of each text is lowercased around 
28 times, on average.  So there can only be a speed saving if the text is 
more than 28 times the effective actual key.

To actually do such a thing, at least more than once, I would be tempted to 
have the key function take just the first k chars, for k perhaps 50.  If 
the texts average, say, 1500 chars, then the key array would be 1/30 * 
perhaps 2 (for object overhead) = 1/15 of the original array space.  Then 
run over the 'sorted' array to see if there are any cases where the first 
50 chars match and if so, check more to see if a swap needs to be made. 
Easiest would be to do a  simple bubble or insert sort with the full 
compare.

tjr






More information about the Python-list mailing list