little question

Fri May 24 13:43:30 EDT 2002

shagshag13 at yahoo.fr (Shagshag) writes:
> As i'm still a newbie in python, comments are welcome.
> ...
> 	documents = {}
> 
> 	documents[0] = 'pease porridge hot pease porridge cold'
> 	documents[1] = 'pease porridge in the pot'
> 	documents[2] = 'nine days old'
> 	documents[3] = 'some like it hot some like it cold'
> 	documents[4] = 'some like it in the pot'
> 	documents[5] = 'nine days old'

I'd write this whole stretch here as 
    documents = ['pease porridge hot pease porridge cold',
                 'pease porridge in the pot',
                 'nine days old',
                 'some like it hot some like it cold',
                 'some like it in the pot',
                 'nine days old']

This turns documents from a dict into a list.

> 	invertedIndex = InvertedIndex()
> 
> 	for i in range(len(documents)):

And I wouldn't do this with a dict; I'd say documents.keys() instead
of range(len(documents)).  But turning it into a list is better.

> 		terms = documents[i].split()
> 		added = []

I think 'added' should be a dict, at which point you can use not
added.has_key(t) instead of t not in added (although t not in added
will still work in recent Pythons) and added[t] = 1 instead of
added.append(t).  For your sample documents, it doesn't matter, but
for documents with more than a few hundred words, you'll spend all
your time scanning the 'added' list otherwise.

> 		for t in terms:
> 			node = PostingListNode(i, terms.count(t))

Similarly, terms.count() has to scan terms from beginning to end; I
would instead use a dict to count the number of times each term is
found in 'terms' and then actually add things to the invertedIndex
after the end of this loop, in another loop over the contents of that
dict.

> class PostingListNode:
> 	def __init__(self, documentID, information = None):

You only instantiate PostingListNodes in one place, and that place
always supplies 'information', so you don't need the default
parameter.

> 	def __str__(self):
> 		s = '(' + str(self._documentID) + ', ' + str(self._information) + ')'
> 		return s

How about 'return str((self.getDocumentID(), self.getInformation()))'?

Frankly, though, I'd be inclined to just use (documentID, information)
tuples instad of defining a class.

> 	def add(self, element, node):
> 		try:
> 			i = self._hash[element]
> 			self._container[i].append(node)
> 			print "Adding word in node list=[%s]" % (element)
> 		except KeyError:
> 			print "Adding new word=[%s] i=[%s]" % (element, self._n)
> 			self._hash[element] = self._n
> 			self._list.append(element)
> 			self._container.append([node])
> 			self._n = self._n + 1

Why don't you just store the lists of nodes that are in _container
directly in the hash instead?  It would shorten add() considerably,
make get_nodes one line long, and __str__ could just return something
like str(self._hash).

> 	def get_nodes(self, element):
> 		try:
> 			i = self._hash[element]
> 			return self._container[i]
> 		except KeyError:
> 			return None

Your caller here is going to be expecting a list in the normal case.
Why is it better to return None instead of raising an exception in the
abnormal case?  I can see that it might be better to return [], but
None?