[Patches] [ python-Patches-425242 ] Patch which "inlines" small dictionaries

Tue, 22 May 2001 12:15:27 -0700

Patches item #425242, was updated on 2001-05-18 10:15
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=425242&group_id=5470

Category: core (C code)
Group: None
Status: Open
>Resolution: Accepted
Priority: 5
Submitted By: M.-A. Lemburg (lemburg)
>Assigned to: Tim Peters (tim_one)
>Summary: Patch which "inlines" small dictionaries

Initial Comment:
As discussed on python-dev, this patch inlines small
dictionary tables directly in the dictionary object.

----------------------------------------------------------------------

>Comment By: Jeremy Hylton (jhylton)
Date: 2001-05-22 12:15

Message:
Logged In: YES 
user_id=31392

timdict2.patch looks good on the few little tests and real
applications that I reported below.  So measurement suggests
it does no harm, and reason suggests it helps.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-05-22 12:03

Message:
Logged In: YES 
user_id=31435

You're suffering an off-by-10000x error <wink>:  
test_mutants as-is actually creates 100 pairs of dicts (so 
that's 200 dicts) of average size 50, and fills the keys 
and values of each with instances of class Horrid (so 
that's about 20,000 additional instance dicts).

If Jeremy is still keen to time things (I kinda doubt it:  
we're not seeing any surprises), I'd much rather time real 
programs now -- in real life dicts aren't just created and 
destroyed, they're also accessed, and the small-but-real 
getitem/setitem critical-path reductions are aimed at the 
latter.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-05-22 10:15

Message:
Logged In: YES 
user_id=38388

Jeremy, I think you ought to test the patches against
benchmarks which actually make extensive use of dictionaries
(test_mutant only uses 2-3 dictionaries).

pybench's Instances.py looks like a good start....

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2001-05-22 08:45

Message:
Logged In: YES 
user_id=31392

Comparing three different dict impls on the same tests.

CVS -- Python 2.2 CVS repository as of May 21, 2001
TIM -- Tim's dict patches to inline small dicts
MAL -- MAL's dict patches to inline small dicts

Test machine: Jeremy's 800MHz PIII, 256MB RAM, Linux 2.2.17

Compiled with gcc 2.95.3 -O3

Tests:

pystone -- the standard pystone benchmark
mutants -- the test_mutants test from the Python regression
test
compile -- the compiler package compiling itself

Results

Results for macro benchmarks are median of 9 runs.

For the mutants and compile benchmarks, measurement is
combined user
and system time reported by Unix time utility.  For the
pystone
benchmark, result is time reported by pystone.

Result table

test              CVS      TIM      MAL
test_mutant       1.49     1.48     1.53
pystone           0.94     0.91     0.94 
compile          11.7     11.65    11.83

I tried running a few of the pybench tests with gprof
enabled, but
the time the runs took varied wildly -- a range of a few
seconds.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-05-22 02:54

Message:
Logged In: YES 
user_id=38388

I think I better leave the timing to Jeremy then. It'll be
another week or two before I get the new machine up and
running.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-05-21 22:29

Message:
Logged In: YES 
user_id=31435

I don't know, but I'd guess you're also using different 
releases of compilers, libraries and an OS since you first 
did this under 1.5.  Any of those could account for it too 
(e.g., in my Win9x days I traced many timing surprises to 
bizarre system malloc() behavior); the Python codebase is 
simply larger now too, thanks to mounds of useless new 
stuff like Unicode <wink>.

BTW, bump the priority on getting that new box:  I finally 
did that myself earlier this year, and it's still a joy 
that *everything* runs 5x faster.  My old machine wasn't 
quite quick enough to run Tomb Raider in high resolution, 
but now I can do that and test Python at the same time.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-05-21 14:01

Message:
Logged In: YES 
user_id=38388

I was a bit surprised at my timings, since I would have
expected a speedup too (I did get a speedup for Python 1.5
back in the old days). I think this has to do with CPU
caches and branch prediction.

In any case, a new machine is in the making :-) This old box
just aint no fun any more...

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-05-21 13:38

Message:
Logged In: YES 
user_id=31435

On my WinTel Win2K box at work, current CVS is ~12350 
pystone w/o my patch and ~12500 with it.  Jeremy's timings 
on his Linux+gcc box show small but reproducible 
improvements on some other "realer" programs.  I'm not 
surprised if it actually slows things on *some* platforms, 
because that's the nature of low-level optimization work.  
But the changes simplify the code on frequent paths, 
reducing the start-to-finish operation counts, and over 
time that's the only thing that reliably improves x-
platform performance.

BTW, the pystone results suggest you could get a factor of 
6 improvement in an hour by buying a midrange new machine 
<wink>.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-05-21 06:53

Message:
Logged In: YES 
user_id=38388

I've tested both patches against the unpatched CVS version
with pystone. The results are surprising: CVS benchmarks at
2300 pystones while your patch runs at 2200 pystones and my
patch at 2100 pystones.

I think we need to put some more thought into this...

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-05-21 02:07

Message:
Logged In: YES 
user_id=31435

There's more explanation about how to bump MINSIZE in the 
patch than in current dictobject.c.  Read the new 
comments!  You have to choose a power of 2 and fiddle the 
polys[] vector to match.  That's all.

The code defines a ma_smalltable member just as yours did.  
The rest is details.  Primarily it's more aggressive about 
exploiting the new member, because I found it was more 
efficient to avoid undue work in the empty case by looping 
on the ma_fill count rather than by continuing to 
constantly fiddle around with NULL-pointer checks all over 
dictobject.c (they've been replaced by asserts).  It does 
pay for a memset() of the ma_smalltable member during 
creation, though.

It would be interesting to try both patches, but I spent 
too much time on this already to debug your patch too.  It 
would be good if you could upload a new version of your 
patch.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-05-21 01:25

Message:
Logged In: YES 
user_id=38388

Hmm, timdict2.patch seems to be a completly different
solution. What I am missing in that patch is the ability to
tune the implementation... what happens if I want to bump
MINSIZE to some higher value ? I think the patch should at
least include an explanation of how this can be done.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-05-20 15:31

Message:
Logged In: YES 
user_id=31435

Just deleted my first patch.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-05-20 15:30

Message:
Logged In: YES 
user_id=31435

timdict2.patch is more aggressive about iterating until 
we've seen fill non-virgin entries go by rather than the 
table size.  This is never slower but is sometimes a 
surprising win!  For example, dicts keyed by a contiguous 
range of small integers tend to fill a solid initial 
segment of the dict.  When resizing or copying, after the 
patch it's common for the loop to get out, e.g., after ~170 
iterations instead of (the current, and always) 256.

Also added some desperately needed comments about what the 
GF(2^n-{0}) business means in pragmatic terms; + a few 
assorted cleanups.

I think this is ready for prime time, but do have concerns 
about the increased memory use (embedding an 8-slot table 
in every dict object means every dict bites another 8*3*4 
== 96 bytes on a 32-bit box; OTOH, dicts with up to and 
including 5 entries never need more space than that, and 
dicts with 3, 4 or 5 entries were actually larger before 
due to the malloc overhead tagging along with their 
dynamically allocated 8-slot table).

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-05-19 20:40

Message:
Logged In: YES 
user_id=31435

Assigned to Jeremy since he's been timing stuff lately 
too:  care to give it a try?

timdict.patch is in some ways more aggressive than MAL's 
patch, taking advantage of that we always have *some* 
memory to point to now, thus allowing to get rid of a bunch 
of table==NULL? tests and associated gotos and labels.

The newer patch passes all the tests I've thrown at it, in 
debug and release builds.  It was a major bitch to get 
working correctly, due to an excruciating interaction among 
PyDict_Clear() and garbage collection and module 
finalization and the special treatment of __builtins__ (who 
would have guessed that allocating a tuple could cause a 
module to finalize!?).

I never figured out why MAL's patch died, and after 
spending 10 hours figuring out why mine did, don't intend 
to waste the rest of the weekend trying <wink>.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-05-18 10:21

Message:
Logged In: YES 
user_id=38388

This is just a quick update of the original patch which I
prepared for Python 1.5 some years ago. Since the dictionary
implementation has evolved somewhat since then, I'm not sure
whether it still works as robust as it does for Python 1.5
(I've been using this for years without any problem).

Running the test suite, their appears to be a hanger due to
some endless loop which occurrs for test_popen2. I'm not
sure where Python hangs, but it could well be that the
resize routines have to adapted a little for the current
dict implementation.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=425242&group_id=5470