[Tutor] Inverted Index Algorithm

Sat Apr 1 09:42:57 CEST 2006

> The next step would be to introduce an index.  I think again, the
> simplest thing that could possibly work would be a literal index of
> every word and every document in which it appears.  This would save
> processing time, but wouldn't be very intelligent.

Yes, that's right, that's the idea of an inverted index.

> This is where I think the inverted indexing comes in.  As I understand
> it we can now produce an index of key words, with document name and
> location in document for each key word. This makes the search more
> involved, and more intelligent.  Finally we could have some logic that
> did some set analysis to return only results that make sense.

Location information would help allow you to do things like phrase or
proximity matching.  Another thing that might help is term frequency (tf).

You might want to check out documentation about Lucene:

    http://lucene.apache.org/java/docs/index.html

as they're the premier open source search library.  They have a
presentation that gives a good overview of the techniques used in a fast
search engine:

    http://lucene.sourceforge.net/talks/inktomi/

If you want to reuse their engine, the OSAF folks have even written Python
bindings to the library:

    http://pylucene.osafoundation.org/