indexing and searching pdf files

Fernando Pérez fperez528 at yahoo.com
Thu Sep 26 20:05:00 EDT 2002


Rajarshi Guha wrote:

> My question - is there already something like this with python?
> Another question which is slightly off topic is, does anybody know of any
> articles/pages that talk about indexing text files efficiently - index
> generaion algorithms etc?

It doesn't directly discuss algorithms, but the following page is an 
_excellent_ overview of the problem:

http://freshmeat.net/articles/view/286/

Please post here any findings you may make on the python/indexing front. This 
is a problem I expect to have to deal with in a few months, so having it 
solved for me ahead of time would be most pleasant :) 

Full-text indexing of PostScript would also be nice, with provisions for 
automatic indexing of gzipped and bz2 compressed files. While we are at it, 
it would be nice to index code in at least C, C++, Fortran, Mathematica and 
Python and build tables of classes/functions defined in each file for the 
index database.

Mmmhhh, what else? Ah, if a .ps/.pdf has an associated .lyx/.tex file, that 
should be indexed instead, with the abstract, author, keywords, etc. fields 
properly recognized.

A search interface with a basic webserver would be enough. 

That would be a start. 

Not asking too much, am I?

Cheers,

f.



More information about the Python-list mailing list