a huge shared read-only data in parallel accesses -- How? multithreading? multiprocessing?

Mon Dec 14 15:12:11 EST 2009

On Dec 11, 11:00 am, Antoine Pitrou <solip... at pitrou.net> wrote:
> I was going to suggest memcached but it probably serializes non-atomic
> types.
Atomic as well.
memcached communicates through sockets[3] (albeit possibly unix
sockets, which are faster than TCP ones).

multiprocessing has shared memory schemes, but does a lot of internal
copying (uses ctypes)... and are particularly unhelpful when your
shared data is highly structured, since you can't share objects, only
primitive types.

I finished a patch that pushes reference counters into packed pools.
It has lots of drawbacks, but manages to solve this particular
problem, if the data is prominently non-numeric (ie: lists and dicts,
as mentioned before). Of the drawbacks, perhaps the bigger is a bigger
memory footprint - yep... I don't believe there's anything that can be
done to change that. It can be optimized, to make the overhead a
little less though.

This test code[1] consumes roughly 2G of RAM on an x86_64 with python
2.6.1, with the patch, it *should* use 2.3G of RAM (as specified by
its output), so you can see the footprint overhead... but better page
sharing makes it consume about 6 times less - roughly 400M... which is
the size of the dataset. Ie: near-optimal data sharing.

This patch[2] has other optimizations intermingled - if there's
interest in the patch without those (which are both unproven and
nonportable) I could try to separate them. I will have to, anyway, to
upload for inclusion into CPython (if I manage to fix the
shortcomings, and if it gets approved).

The most important shortcomings of the refcount patch are:
 1) Tripled memory overhead of reference counting. Before, it was a
single Py_ssize_t per object. Now, it's two pointers plus the
Py_ssize_t. This could perhaps be optimized (by getting rid of the
arena pointer, for instance).
 2) Increased code output for Py_INCREF/DECREF. It's small, but it
adds up to a lot. Timings on test_decimal.py (a small numeric
benchmark I use, which might not be representative at all) shows a 10%
performance loss in CPU time. Again, this might be optimized with a
lot of work and creativity.
 3) Breaks binary compatibility, and in weird cases source
compatibility with extension modules. PyObject layout is different, so
statically-initialized variables need to stick to using CPython's
macros (I've seen cases when they don't), and code should use Py_REFCNT
() for accessing the refcount, but many just do ob->ob_refcnt, which
will break with the patch.
 4) I'm also not really sure (haven't tested) what happens when
CPython runs out of memory - I tried real hard not to segfault, even
recover nicely, but you know how hard that is...

[3] http://code.google.com/p/memcached/wiki/FAQ#How_does_it_compare_to_a_server_local_cache?_(PHP%27s_APC,_mm
[2] http://www.deeplayer.com/claudio/misc/Python-2.6.1-refcount.patch
[1] test code below

import time
from multiprocessing import Pool

def usoMemoria():
    import os
    import subprocess
    pid = os.getpid()
    cmd = "ps -o vsz=,rss=,share= -p %s --ppid %s" % (pid,pid)
    p = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
    info = p.stdout.readlines()
    s = sum( int(r) for v,r,s in map(str.split,map(str.strip, info)) )
    return s

def f(_):
    return sum(int(x) for d in huge_global_data for x in d if x !=
"state") # my sofisticated formula goes here

if __name__ == '__main__':
    huge_global_data = []
    for i in xrange(500000):
        d = {}
        d[str(i)] = str(i*10)
        d[str(i+1)] = str(i)
        d["state"] = 3
        huge_global_data.append(d)

    p = Pool(7)
    res= list(p.map(f, xrange(20)))

    print "%.2fM" % (usoMemoria() / 1024.0)