an intern-like memory saver

sjmachin at lexicon.net sjmachin at lexicon.net
Thu Apr 13 18:46:13 EDT 2000


Problem:
I have an application that works with words and can have millions of them
in memory at one time. Apart from the main data structure, a dictionary is
used to maintain frequencies. As the words are loaded from files, multiple
instances of the same word don't share memory. The memory savings
could be huge --- see how frequent "the" is in English text, or "Smith" in
an Anglo telephone directory.

Trial solution:
(1) Clone dictobject.c. Make it into an extension module for a type called
"mydict". Add a method called "key_ref".with one argument:
adict.key_ref(obj). If adict.has_key(obj) is true, this returns a reference to
the key value inside the dictionary; else it returns "obj".
(2) Make simple changes to the application:
(a) Change
   freq_dict = {}
to
   freq_dict = mydict.mydict()
(b) assuming for purposes of exposition that words are stored simply in a
list, after
   freq_dict[w] = freq_dict.get(w, 0) + 1
change
   word_list.append(w)
to
   word_list.append(freq_dict.key_ref(w))

Results:
Gratifying. An exercise that was running out of real memory (384 MB) and
taking a day now takes an hour or so.

Questions:
(1) Would this be sufficiently generally useful to make it a method of the
standard dictionary object in Python?
(2) As methods seem to be found by sequential search, wouldn't it be a
good idea to move "get" a bit higher up the method_def list in
dictobject.c? At the end, after "update" and "copy", doesn't seem like a
good idea.
(3) Has anyone had any success in compiling Python on WinNT 4.0 with
gcc 2.95.2? It's just fine for making extension modules; I haven't tried
compiling the whole Python yet.
(4) Has anyone any better ideas for gauging Python memory usage than
sitting watching the graphical display in WinNT's Task Manager? Might an
instrumented or instrumentable malloc/free package (like Doug Lea's) that
permitted implementation of a Python builtin memused() be the way to go,
or is there a policy of using the standard malloc from the C library on each
platform?


Sent via Deja.com http://www.deja.com/
Before you buy.



More information about the Python-list mailing list