[Doc-SIG] Approaches to structuring module documentation

Fred L. Drake, Jr. fdrake@acm.org
Thu, 11 Nov 1999 12:00:45 -0500 (EST)


--1wGfXqjHrK
Content-Type: text/plain; charset=us-ascii
Content-Description: message body text
Content-Transfer-Encoding: 7bit


  Well, now that things have quieted down a little (where?!), I'll
stir things up a little.
  Two broad approaches to structuring the documentation have been
presented:  One is the current document-centric model, where there are
a number of books/manuals/whatever that contain interesting
information, but need to be used as really large chunks.  Extracting
specific information is (appearantly) difficult for humans (witness
the recent request for a random() function on the newsgroup by someone 
who said they looked in the index; just the wrong one); it's much
worse for applications.  The other approach, first proposed by Sean
McGrath, is to use a "microdocument" architecture, where each module
is represented in a separate structured document that is designed
specifically to handle that kind of information.
  First, I'll define some terms and comment on both approaches.

Terms
-----

  DOCUMENT-ORIENTED CONTENT:  Documents which are structured similarly
to the traditional presentation form; document-oriented DTDs feature
things like chapters, sections, titles, articles, etc.  This is what
David Megginson called "book" DTDs in "Structuring XML Documents."

  DOCUMENT-CENTRIC APPROACH: The human-read document is the primary
way to encode information, including module reference material.  A
"monumental" DTD would dedscribe the document structure.  Supplemental
data files could be used for highly specialized information; these
could use alternate DTDs.

  MICRODOCUMENT APPROACH:  Multiple DTDs are used to encode
document-level information and module reference material.  Let's only
consider the case of one DTD to handle module reference material, and
a small number (1 or 2) of document-oriented DTDs; possibly one for
"sections" and one that could be used to compose sections and module
references into chapters and manuals.

Document-centric Approach
-------------------------

  This approach has the advantage of matching the current structure of 
the documentation.  The conversion isn't terribly difficult or even
time consuming given the state of the things in Doc/tools/sgmlconv/ in
the CVS repository.  There's clearly some work to do regarding DTD
specification and probably a bit of transformation, but a large part
of the coding and testing is done.
  The existing documents are tolerably organized for direct human use,
and incremental updates to the documents seem to work well.
  Documenting a module using the document-centric approach requires
little effort due to the simplicity of the existing markup, but it's
not always clear what things "go together."  This problem can be at
least partly solved by evolving the markup to support additional forms 
of linkages between information chunks, and keeping the processing
tools up to date with the markup changes.  This can be done before or
after a conversion to XML as it is largely orthagonal to syntax.


Microdocument Approach
----------------------

  Using a separate DTD to document modules offers advantages when it
comes time to extract information programmatically.  Creating skeleton 
module references from the current documentation would be harder and
would certainly require more code to be written, but the payoffs are
potentially very high.  To really make it work, a lot of attention
would have to be applied to the result of the first-stage conversion
to check the accuracy of the results, make the various bits of text
actually land in the right place (since everything is pretty much
thrown together now), and encode a lot of additional information about 
types, parameters, exceptions thrown, etc.  On the other hand, getting 
this information into the documents in the document-centric approach
also requires a lot of this work.
  An IDE could use the content provided by the module references very
effectively to provide help and smart name completion.  For
performance, the documentation would probably be loaded into some sort
of database so chunks of information could be retrieved very quickly,
and probably in some pre-digested form.  Inheritance diagrams can be
generated, and protocols/interfaces can be documented much more
clearly.
  The most significant drawback I can see is that the markup can very
easily become quite heavy, but this isn't unusual when there's a lot
of structured information to present.


Comparison
----------

  A wide variation in module documentation styles is possible using
the document-centric approach.  While most of the modules in the
Library Reference are presented in a fairly formulaic way, some are
not.  Note the chapters on the debugger and profiler, which really
don't use the styles used elsewhere in the Library Reference.  I'm not
sure if allowing this level of flexibility is good or bad; I could
make the case for both.  I can also see where allowing both could be a
good idea, but it may be reasonable to require a "standard" structure
for module documentation, regardless of the approach taken on the
whole, and then allow additional material to be provided using
document-oriented content.
  At any rate, last night I sat down with one module and the existing
documentation for it, and marked up a module reference for it using
the microdocument approach.  The markup is quite heavy compared to the 
current LaTeX file:

weyr(.../Doc/lib); wc libmailbox.tex mailbox.xml 
      53     251    1938 libmailbox.tex
     159     504    5364 mailbox.xml
     212     755    7302 total

  That's a 200% increase in line count and a 150% increase in file
size.  The later isn't much of an issue, but the former is because it
seriously impacts readability.
  This explosion of markup is of most concern for authors; a lot of
markup is required to encode enough information to justify changing
the approach.  As more markup is required, it is increasingly
difficult to get contributions because it takes the authors more time
to document their work.  I'd like to maintain Python's standing as the 
best-documented free scripting language, and I'm not sure authors will 
be willing to use the more extensive markup.
  I'd also need a small (large?) army of volunteers to help convert
the generated skeleton module references to take advantage of the
ability to encode far more detail about modules than is currently
available.  Are there enough people sufficiently interested?  Doing
this one myself would require someone directly supporting the work;
the occaissional evening would not get it done.


A Hybrid Approach
-----------------

  A hybrid approach could be taken in which the architecture is that
of the microdocument approach, but we support something similar to the 
current (document-centric) approach for the document-oriented content
components.  This would allow a slower migration and facilities such
as the debugger could be documented using the document structure
rather than the module structure.
  The payoffs for application of the documentation are approximately
the same as for the strict microdocument approach.  The most
significant change is probably that some modules (those documented
only in document-oriented components) may not be described in the help 
system, or at least not fully described.
  The issues of conversion are largely the same as for the
microdocument architecture since most modules would be documented in
that way.  The document-oriented DTD(s) may be a little different, but 
that's the only substantial techical difference I see in getting it
done.


Status
------

  I haven't ventured to write a DTD yet for either approach; there's
still a lot to decide before that gets done.  I also don't want to
write a bunch of DTDs that aren't going to be used!
  I think we do need to consider the two approaches in the immediate
future.  Dealing with the legacy conversion software is tolerable for
now, but it's getting worse over time.  Rich linking is difficult in
the HTML output, which seems to be the most-used format, but I think
that's something that a lot of people would like to see.
  If we elect to go with the document-centric approach, there's a bit
of DTD design to do, and a bit of tweaking in the conversion tools,
but we're a long way there.
  Adopting the microdocument approach offers the advantages of a very
high long-term payoff, which is appealing, but please consider my
comments and pleas above carefully.
  The hybrid approach can be considered as roughly the same as the
microdocument approach, as discussed above.

  Comments?


  -Fred

--
Fred L. Drake, Jr.	     <fdrake@acm.org>
Corporation for National Research Initiatives


--1wGfXqjHrK
Content-Type: text/xml; charset=iso-8859-1
Content-Description: Sample module reference.
Content-Disposition: inline;
	filename="mailbox.xml"
Content-Transfer-Encoding: 7bit

<?xml version="1.0" encoding="iso-8859-1"?>
<module-reference>
  <module-info>
    <module>mailbox</module>
    <synopsis>Read various mailbox formats.</synopsis>
    <!-- possibly add "requires" or "imports" information here, as -->
    <!-- well as platform dependence, etc. -->
    </module-info>

  <overview>
    <para>This module defines a number of classes that allow easy and
      uniform access to mail messages in a mailbox.  Most of the
      supported mailbox formats come from the Unix world.</para>

    <para>None of the classes defined in this module lock the
      mailboxes that are accessed; this needs to be handled by
      application code.</para>
    </overview>

  <protocoldesc>
    <protocol>Mailbox</protocol>
    <method name="next">
      <signature>
        <return-value type="rfc822.Message">
          <!-- need a good way to distinguish between protocols and -->
          <!-- types, both visually and in the markup. -->
          The next message in the mailbox.  The message's <member
            of="rfc822.Message">fp</member> will be a
          <protocol>file</protocol> object, but not a real
          <type>file</type> object.  If no messages have been
          read, this will be the first message.  If all messages have
          been read, <constant>None</constant> will be returned.
          </return-value>
        </signature>
      </method>
    </protocoldesc>

  <classdesc>
    <class>UnixMailbox</class>
    <implements>
      <protocol>Mailbox</protocol>
      </implements>
    <description>
      Access a classic Unix-style mailbox, where all messages are
      contained in a single file and separated by <quote>From name
        time</quote> lines.
      </description>
    <constructor>
      <signature>
        <parameter name="fp" protocol="file">
          The file object <param>fp</param> points to the mailbox file.
          </parameter>
        </signature>
      <description>
        <para>Initialize the mailbox object and point to the first
          message in the mailbox.</para>
        </description>
      </constructor>
    </classdesc>

  <classdesc>
    <class>MmdfMailbox</class>
    <implements>
      <protocol>Mailbox</protocol>
      </implements>
    <description>
      <para>Access an <acronym>MMDF</acronym>-style mailbox, where all
        messages are contained in a single file and separated by lines
        consisting of four control-A characters.</para>
      </description>
    <constructor>
      <signature>
        <parameter name="fp" protocol="file">
          The file object <param>fp</param> points to the mailbox file.
          </parameter>
        </signature>
      <description>
        <para>Initialize the mailbox object and point to the first
          message in the mailbox.</para>
        </description>
      </constructor>
    </classdesc>

  <classdesc>
    <class>MHMailbox</class>
    <implements>
      <protocol>Mailbox</protocol>
      </implements>
    <description>
      <para>Access an <acronym>MH</acronym> mailbox, a directory with
        each message in a separate file with a numeric name.  Messages
        that are added to the mailbox after the instance is created
        are not accessible; a new instance is needed to access newly
        added messages.</para>
      </description>
    <constructor>
      <signature>
        <parameter name="dirname" type="string">
          The name of the mailbox directory.
          </parameter>
        </signature>
      <description>
        <para>Initialize the list of messages that can be loaded from
          the mailbox.</para>
        </description>
      </constructor>
    </classdesc>

  <classdesc>
    <class>Maildir</class>
    <implements>
      <protocol>Mailbox</protocol>
      </implements>
    <description>
      <para>Access a Qmail mail directory.  All new and current mail
        for the mailbox is made available.  Messages that are added to
        the mailbox after the instance is created are not accessible;
        a new instance is needed to access newly added messages.
        </para>
      </description>
    <constructor>
      <signature>
        <parameter name="dirname" type="string">
          The name of the mailbox directory.
          </parameter>
        </signature>
      <description>
        <para>The <param>dirname</param> parameter points to the
          mailbox directory.</para>
        </description>
      </constructor>
    </classdesc>

  <classdesc>
    <class>BabylMailbox</class>
    <implements>
      <protocol>Mailbox</protocol>
      </implements>
    <description>
      <para>Access a Babyl mailbox, which is similar to an
        <acronym>MMDF</acronym> mailbox.  Mail messages start with a
        line containing only <literal>'*** EOOH ***'</literal> and end 
        with a line containing only <literal>'\037\014'</literal>.
        </para>
      </description>
    <constructor>
      <signature>
        <parameter name="fp" protocol="file">
          A <protocol>file</protocol> object <param>fp</param> that
          points to the mailbox file.
          </parameter>
        </signature>
      <description>
        <para>Initialize the mailbox object and point to the first
          message in the mailbox.</para>
        </description>
      </constructor>
    </classdesc>
</module-reference>

--1wGfXqjHrK--