Efficient posting-list

Terry Reedy tjreedy at udel.edu
Mon Jun 3 09:37:03 EDT 2002


"Shagshag13" <shagshag13 at yahoo.fr> wrote in message
news:adffdk$10mep9$1 at ID-146704.news.dfncis.de...
> A posting list is something like that (hope i'm not too wrong) :
>
> [key_0] -> ...
> ...
> [key_i] -> [id_j, informations] -> [id_k, informations] -> ...
> ...
> [key_n] -> ...
>
> where in information retrieval :
> - key are often words,
> - id are document id (so we have key_i in document id_j and in
document id_k and ...)
> - informations are often in document frequency, but might be more...
>
> I'm looking for an efficient way of implementing this in full
python, by now i use a dict
> for keys and a python list containing node objects for [id_j,
informations].
> But i have two troubles this is very slow to populate, and too
memory consuming do you
> have an idea for optimizing this ?
>
> (keep in mind that : there are more than 500,000 text keys ; id are
numbered from 1 to
> 150,000 ; length of a posting list might be from 1 to 500,000...)

For English text, one would almost certainly reduce the number of text
keys by deleting some suffixes, so that 'statistic', 'statistics',
'statistical', 'statistically', and maybe 'statistician' would be one
key.  But I have no idea of word structure of non-Indo-European
languages and what one should do to reduce key number.

By 'length of a posting list' do you mean number of text keys?  Ie,
size of key dict?  The length of list attached to each key is at most
number of documents.

You do not specify 'node object'.  This is crucial since these take up
most of memory.  (This a good reason not to index the 2000 or so most
common keys.)  For pure Python, tuple is most space efficient.  If
each node really is (integer id, integer count), one could devise a
system using a pair of array (module) objects or 2-dimensional
numerical python arrays; which store actual integers rather than
pointers to int PyObjects.  When filled, they would have to be copied
to larger arrays in much the same manner as done automatically by
Python lists.

Terry J. Reedy






More information about the Python-list mailing list