Surprising (for me) benchmark results...

Courageous jkraska1 at san.rr.com
Tue May 1 21:31:57 EDT 2001


On Wed, 02 May 2001 01:11:50 GMT, "Rainer Deyke" <root at rainerdeyke.com> wrote:

>"Daniel Berlin" <dan at www.cgsoftware.com> wrote in message
>news:mailman.988762570.8591.python-list at python.org...
>> Actually, the sort is probably not the bottleneck.
>> As the file gets larger, the bottleneck becomes disk time, not sort time.
>
>Disk access is O(N); sorting is typically O(N log N).  Therefore as the
>number of entries in the file size increases, the time taken by the sort
>becomes more significant and the time taken by disk access becomes less
>significant.  This is assuming that the file fits into memory (as was the
>case here).

Disregarding the algorithm and this specific situation in question, you are
both correct; this is because while disk access may scale O(N), this
scaling is multiplied by a huge constant factor that can in many cases
dwarf the sorting algorithm, depending on the number of entries and
the like. Things like this are exactly what I was alluding to when I referenced
"primary and secondary complexity factors" in my other message.

At the moment, for example, I'm working on a modification to Python's
internal hash tables that uses a polymorphic dictionary, where for very
small dictionaries, a simple array is used. Using array logic for lookups
on very small arrays allows me to manually unroll the loops; hidden
complexity costs are elminated in this fashion, and maximum parallelism
is gained. 

Preliminary statistical analysis of the python interpreter show that the vast
majority of all Python dictionaries during the life of the interpreter are of
n < 8, with the majority of those being n < 4. At the moment I've sped up
these common cases by 50%; I'm toying around with MMX/SSE assembly
to see if I can even make them faster.

[N.B.: this is a preliminary finding, and hasn't been well-integrated into
Python yet, so there are still mitigating factors which might undo my ability
to continue with this optimization route].

More later,


Joe Kraska
San Diego
CA




More information about the Python-list mailing list