[Doc-SIG] An 'apropos' utility for documentations

Wed, 9 Dec 1998 13:26:24 -0500 (EST)

Manuel Gutierrez Algaba writes:
 > No, and because of a very simple reason. Although Linux apropos
[Really long explanation elided.]
 > structuring are not enough.

Manuel,
  I think I see what you're looking for.  (For context:  I have
studied traditional information retrieval, but not natural language
processing approaches.)
  Let me try to boil down what you've described to a (much) more
concise description, and then follow on with my comments.  If I
misunderstand what you're asking for, please clarify.

My summary of what you explained:
  You are looking for a concept-based search mechanism, which can
preferably described what sorts of relationships the located items
have to each other ("this is an example of that", etc.).  You indicate 
an advantage of automatic concept extraction based on the content.
From a user interface perspective, it sounds like each "chunk" of
documentation presented should have some sort of entry box or button
that searches for other chunks related to the chunk on that page.

My response:
  I think this would be really nice to have.  As far as I'm aware,
such systems are still largely research projects, with some
applications having reached deployment (you point to good examples).
To do this for the Python documentation (defined as broadly as
needed), the most-needed thing to accomplish this is someone who can
donate time and know-how.  I don't know enough about the AI aspects or 
the natural language processing aspects.  The user interface issues
are also non-trivial (esp. if the interface can be distilled all the
way down to a single button and maybe a text-entry box).  But I'd be
glad to work with someone regarding interpretation of the existing
documentation and any improvements that could be made to make the
processing more effective.
  There are two aspects to this which are related but not tightly
bound:  extraction of "concepts" and use of concepts to locate
interesting information.  Concepts can be extracted from the text
using AI/NLP tools or can be marked explicitly in the documentation
source.  I must admit a bias toward the latter approach, but automated 
techniques may have progressed sufficiently to make them viable.  I do 
not see any reason for the approaches to concept extraction to be
mutually exclusive.  What constitutes a "chunk" needs to be clearly
defined, both for purposes of hyper-navigation and percolation of
concept assignments up and down the document structure hierarchy.
  Use of a concept-to-chunk database may need to know about the
extraction techniques (at least the explicit vs. automatic dichotomy), 
especially for purposes of ranking or presentation.
  I think we can go a long way using techniques based on explicit
markup in the documentation.  The index construction markup is one
example of "meta" information being located in the documents, and
other aspects of the markup are becoming increasingly "logical" rather 
than presentation-based.  There is no reason that two things can't
both happen:  1) additional meta information be added to the documents
to allow explicit encoding of concept-like information, and 2)
processing software imply relationships between chunks based on
existing markup.
  With the coming conversion of the documentation to SGML, I expect
some information present in the documentation today will become more
explicit, making it somewhat easier to create processing software that 
doesn't have to make as many basic inferences as it has to today.
(Yes, I realize that this doesn't come from SGML, but the conversion
is an excellent opportunity for us to refine the markup in more useful 
ways than has been the case with the existing markup.)
  I'm quite interested in hearing from people about what information
would be useful if marked explicitly, and how it could be used.

  -Fred

--
Fred L. Drake, Jr.	     <fdrake@acm.org>
Corporation for National Research Initiatives
1895 Preston White Dr.	    Reston, VA  20191