Efficient posting-list

Mon Jun 3 10:23:42 EDT 2002

> For English text, one would almost certainly reduce the number of text
> keys by deleting some suffixes, so that 'statistic', 'statistics',
> 'statistical', 'statistically', and maybe 'statistician' would be one
> key.  But I have no idea of word structure of non-Indo-European
> languages and what one should do to reduce key number.

I already do stemming via a pos tagger...

> By 'length of a posting list' do you mean number of text keys?  Ie,
> size of key dict?  The length of list attached to each key is at most
> number of documents.

you're right... key dict will be about 500,000 - 600,000, i currently have
140,000 documents but i'll reach 400,000 soon...

> You do not specify 'node object'.

By now i use a node with : 2 int and 1 float. Node objects are handled
in a python list.

> This is crucial since these take up
> most of memory.  (This a good reason not to index the 2000 or so most
> common keys.)

I already index only useful parts and can't *shrink* it more...

> For pure Python, tuple is most space efficient.  If
> each node really is (integer id, integer count), one could devise a
> system using a pair of array (module) objects or 2-dimensional
> numerical python arrays; which store actual integers rather than
> pointers to int PyObjects.  When filled, they would have to be copied
> to larger arrays in much the same manner as done automatically by
> Python lists.

So here you think that i should use only one list (i can't use a tuple, i
need to update values) containing :

[id_x, count_x, float_x, id_x+1, count_x+1, float_x+1, ...]

and so on ?

thanks,

s13.