Is a 32-bit build faster than a 64-bit build

Sat Nov 13 07:28:58 EST 2010

On Fri, 12 Nov 2010 13:24:09 -0800 (PST)
Raymond Hettinger <python at rcn.com> wrote:
> Has anyone here benchmarked a 32-bit Python versus a 64-bit Python for
> Django or some other webserver?
> 
> My hypotheses is that for apps not needing the 64-bit address space,
> the 32-bit version has better memory utilization and hence better
> cache performance.  If so, then switching python versions may enable a
> single server to handle a greater traffic load.  Has anyone here tried
> that?

On micro-benchmarks, x86-64 is always faster by about 10-30% compared
to x86, simply because of the extended register set and other
additional niceties.

On a benchmark stressing the memory system a little more such as
dcbench-py3k.py in http://bugs.python.org/issue9520, the 64-bit build
is still faster until the tested data structure (a large dict)
overwhelms the 2MB last-level cache in my CPU, after which the 32-bit
build becomes 10% faster for the same numbers of elements:

To be clear, here are the figures in 64-bit mode:

   10000 words (    9092 keys), 2893621 inserts/s, 13426069 lookups/s, 86 bytes/key (0.8MB)
   20000 words (   17699 keys), 3206654 inserts/s, 12338002 lookups/s, 44 bytes/key (0.8MB)
   40000 words (   34490 keys), 2613517 inserts/s, 7643726 lookups/s, 91 bytes/key (3.0MB)
   80000 words (   67148 keys), 2579562 inserts/s, 4872069 lookups/s, 46 bytes/key (3.0MB)
  160000 words (  130897 keys), 2377487 inserts/s, 5765316 lookups/s, 48 bytes/key (6.0MB)
  320000 words (  254233 keys), 2119978 inserts/s, 5003979 lookups/s, 49 bytes/key (12.0MB)
  640000 words (  493191 keys), 1965413 inserts/s, 4640743 lookups/s, 51 bytes/key (24.0MB)
 1280000 words (  956820 keys), 1854546 inserts/s, 4338543 lookups/s, 52 bytes/key (48.0MB)

And here are the figures in 32-bit mode:

   10000 words (    9092 keys), 2250163 inserts/s, 9487229 lookups/s, 43 bytes/key (0.4MB)
   20000 words (   17699 keys), 2543235 inserts/s, 7653839 lookups/s, 22 bytes/key (0.4MB)
   40000 words (   34490 keys), 2360162 inserts/s, 8851543 lookups/s, 45 bytes/key (1.5MB)
   80000 words (   67148 keys), 2415169 inserts/s, 8581037 lookups/s, 23 bytes/key (1.5MB)
  160000 words (  130897 keys), 2203071 inserts/s, 6914732 lookups/s, 24 bytes/key (3.0MB)
  320000 words (  254233 keys), 2005980 inserts/s, 5670133 lookups/s, 24 bytes/key (6.0MB)
  640000 words (  493191 keys), 1856385 inserts/s, 4929790 lookups/s, 25 bytes/key (12.0MB)
 1280000 words (  956820 keys), 1746364 inserts/s, 4530747 lookups/s, 26 bytes/key (24.0MB)

However, it's not obvious to me that a program like "Django or some
other webserver" would have really bad cache locality. Even if the
total working set is larger than the CPU cache, there can still be
quite a good cache efficiency if a large fraction of CPU time is spent
on small datasets.

By the way, I've been experimenting with denser dicts and with
linear probing (in the hope that it will improve cache efficiency and
spatial locality in real applications), and there doesn't seem to be
adverse consequences on micro-benchmarks. Do you think I should upload
a patch?

Regards

Antoine.