Efficient posting-list
Shagshag13
shagshag13 at yahoo.fr
Mon Jun 3 10:23:42 EDT 2002
> For English text, one would almost certainly reduce the number of text
> keys by deleting some suffixes, so that 'statistic', 'statistics',
> 'statistical', 'statistically', and maybe 'statistician' would be one
> key. But I have no idea of word structure of non-Indo-European
> languages and what one should do to reduce key number.
I already do stemming via a pos tagger...
> By 'length of a posting list' do you mean number of text keys? Ie,
> size of key dict? The length of list attached to each key is at most
> number of documents.
you're right... key dict will be about 500,000 - 600,000, i currently have
140,000 documents but i'll reach 400,000 soon...
> You do not specify 'node object'.
By now i use a node with : 2 int and 1 float. Node objects are handled
in a python list.
> This is crucial since these take up
> most of memory. (This a good reason not to index the 2000 or so most
> common keys.)
I already index only useful parts and can't *shrink* it more...
> For pure Python, tuple is most space efficient. If
> each node really is (integer id, integer count), one could devise a
> system using a pair of array (module) objects or 2-dimensional
> numerical python arrays; which store actual integers rather than
> pointers to int PyObjects. When filled, they would have to be copied
> to larger arrays in much the same manner as done automatically by
> Python lists.
So here you think that i should use only one list (i can't use a tuple, i
need to update values) containing :
[id_x, count_x, float_x, id_x+1, count_x+1, float_x+1, ...]
and so on ?
thanks,
s13.
More information about the Python-list
mailing list