Mass Text Indexing Tools

jerry_spicklemire at my-deja.com jerry_spicklemire at my-deja.com
Tue Oct 17 12:10:50 EDT 2000


In article <qrUG5.8500$Qf5.153919 at newsread1.prod.itd.earthlink.net>,
  "Ender" <kthangavelu at earthlink.net> wrote:
> Does anyone know of some good mass text indexing/searching tools
> (preferrable open source) that are accessible from python. i've tried
> using popen2 calls to grep but it starts to flag around 50Mbs. text
>  material consists of around a hundredb thousand small files (emails).
>

Check out:

	http://ransacker.sourceforge.net/

"Ransacker is a scriptable, incrementally-double-indexed search engine
written in python.

It's scriptable in that you can index any text with any key. This makes
it easy to index content ("pages") stored in databases, file systems,
the web, etc.

It can index incrementally. This means you can add content or update
the entry for a particular page without touching the rest of the index.

It's double-indexed, meaning that not only does every word have a list
of pages, every page has a list of words. This is used for the
incremental indexer, but also allows you to determine which pages have
the most in common. This will allow ransacker to produce "what's
related" pages."


Sent via Deja.com http://www.deja.com/
Before you buy.



More information about the Python-list mailing list