[Tutor] Writing an automatic document keyword finder?

Michael Powe michael@trollope.org
Thu Dec 26 01:09:02 2002


--sdtB3X0nJg68CQEu
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Dec 25, 2002 at 11:33:05AM -0800, Danny Yoo wrote:

> > sometimes, it almost seems like the indexing procedure consists of
> > somebody reading through the text and arbitrarily deciding which terms
> > on a page or in a section to index.

> Hi Michael,

> Hmmm... maybe it might make a nice programming project to create an
> keyword-finding program.  This would involve trying to automatically
> identify important key phrases that represent the main ideas of a book.

> But what is a key phrase, though?  Perhaps it's one that's used a lot. We
> can cook up a quick histogram function to find the most frequently used
> words:

yes, it is an interesting puzzle.  actually, what i was suggesting was
that the method of marking index terms consisted of said reading.  by
this i was referring to the number of times i have found terms indexed
in one part of a book, only to find them used in important ways in
other parts of the book but not indexed there.

the actual parsing of the text to insert indexing markers is already
automated, i believe, for TeX and troff.  i think i read about that
somewhere, i'll have to look it up again.  you can give the indexing
program a list of words and it will properly mark them in the text so
that the a program like makeindex will pull them out and create the
indexes when the text is run through the marking-up software.

so the trick is, as you say, to determine what terms go into the
list.  after all, you can have multiple levels of indexing, e.g.:

search
 and replace,79
 and replace within a text block,87
 backward for a pattern,44
 combine opening a file with,51
 for general class of words,87
 global (see global replacement)
 ignoring case,85,104,107
 ... (from Learning the Vi Editor)

i really have to think that the best indexing is going to come from
the author or someone who knows the subject very well.

mp

--=20
  Michael Powe                                 Portland, Oregon USA
-------------------------------------------------------------------
"The most likely way for the world to be destroyed, most experts
agree, is by accident.  That's where we come in. We're computer
professionals. We cause accidents."
	       -- Nathaniel Borenstein, inventor of MIME

--sdtB3X0nJg68CQEu
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+CptLUFRfMHc4lysRAhTjAJ9NhCOUAiW3fdzHO32H2qtsNuXHPQCgrgf3
5uIiFGd4fz5OPeTgP/Drqh4=
=WzVq
-----END PGP SIGNATURE-----

--sdtB3X0nJg68CQEu--