indexing and searching pdf files
Fernando Pérez
fperez528 at yahoo.com
Thu Sep 26 20:05:00 EDT 2002
Rajarshi Guha wrote:
> My question - is there already something like this with python?
> Another question which is slightly off topic is, does anybody know of any
> articles/pages that talk about indexing text files efficiently - index
> generaion algorithms etc?
It doesn't directly discuss algorithms, but the following page is an
_excellent_ overview of the problem:
http://freshmeat.net/articles/view/286/
Please post here any findings you may make on the python/indexing front. This
is a problem I expect to have to deal with in a few months, so having it
solved for me ahead of time would be most pleasant :)
Full-text indexing of PostScript would also be nice, with provisions for
automatic indexing of gzipped and bz2 compressed files. While we are at it,
it would be nice to index code in at least C, C++, Fortran, Mathematica and
Python and build tables of classes/functions defined in each file for the
index database.
Mmmhhh, what else? Ah, if a .ps/.pdf has an associated .lyx/.tex file, that
should be indexed instead, with the abstract, author, keywords, etc. fields
properly recognized.
A search interface with a basic webserver would be enough.
That would be a start.
Not asking too much, am I?
Cheers,
f.
More information about the Python-list
mailing list