[DOC-SIG] Library reference manual debate

Guido van Rossum guido@CNRI.Reston.Va.US
Sat, 15 Nov 1997 15:30:49 -0500


Some SGML extremists have started lobbying for SGML or XML, which has
brought up quite a religious debate (maybe started by my remark that
SGML is not fit for humans to type :-).  I feel that we're not getting
anywhere unless we face some of the facts, so here's a reality check
followed by some opinions.

I hope I've moved the doc string discussion to a separate thread.  I
don't think the library manual should be tied in with doc strings in
any way, so it can be discussed separately.

The first problem is that the library manual is currently done in
LaTeX.  I would guess that 99% of the markup is structural -- the only
places where physical markup is used in a significant way is in the
use of 'strong' and 'emphasis' to mean a number of different things
(e.g. warnings, notes, implementation restrictions, etc.).  There are
a few places where physical markup is used to overcome some formatting
weirdnesses, but I've always tried to keep these to a minimum.

Any proposed solution that doesn't take into account how to convert
the existing library manual is a trivial reject.

I see a number of problems with the use of LaTeX -- but the fact that
"it's not SGML" is not one of them.  Perhaps the biggest problem is
that LaTeX and TeX are losing popularity.  TeX may still be the
standard for respectable and somewhat conservative publications like
the Astrophysical Journal, but most publishers nowadays are just as
happy to accept MS Word or other popular wordprocessors.

I would say that the one remaining reason to use TeX or LaTeX for some
groups is that TeX does mathematics better than anything else; however
that's not relevant for the Python community.  From experience, I
would say that LaTeX does computer documentation rather poorly
(witness the many hacks in the myformat.sty file), and I haven't even
dealt properly with optional or keyword arguments, let alone classes
and methods and inheritance.

The decreasing popularity of LaTeX is a problem because it means that
potential contributors are discouraged -- many simply don't know
LaTeX, and even those that do know it may not have access to an
implementation any more.  Installing LaTeX is a major undertaking, and
one is less and less likely to find installations that already have it
installed, outside central Unix servers at academic institutions.  (I
did a web search on LaTeX for Windows 95; one of the pages,
http://www2.eece.maine.edu/~dprice/Latex/latex.htm, which seems to
have a lot of useful info, leaves me with the impression that one
needs to be *very* motivated to bring this to a good end.  It ends
with the admonition "Good Luck! You're gonna need it...")

Another problem, caused by this, is that there are few LaTeX hackers
around who can help with the creation of new macros (e.g. for keyword
arguments).

On the plus side, there is truth in the old saying "don't fix it if it
ain't broken."  I personally have access to a working LaTeX
installation, the latex2html converter produces adequate HTML (I still
need to work on the translation for a few of the environments
introduced by myformat.sty, but that shouldn't be too hard), and I
haven't heard too many complaints yet from people who would like to
contribute documentation but don't know LaTeX -- they pick it up
pretty easily from the template I provide.

*** The real problem seems how to get people to contribute at all! ***

If using SGML or XML would make more people eager to contribute, I
might be convinced; but somehow I doubt it.  At the moment, both the
learning curve and the installation effort for SGML or XML tools
appears to be still steeper than for LaTeX.

There has been some debate on SGML vs. XML.  It seems that SGML can be
made easy to type, at the cost of making it much harder to parse
correctly.  XML appears to be designed mostly as a transport format
(one page with XML info I found made the explicit point that being
easy to type was *not* a design criterium).  Anyway, once a decision
to use either is made, conversion between the two is probably easy,
especially since XML is a true subset of SGML.

Finally, TIM has been brought up.  It's a bit easier to type and more
pleasing to my eye than shorthand SGML (e.g. SGML <title>whatever</>
vs. TIM @title{whatever}) and it's a lot easier to parse.  It uses
structural markup and has a simple macro language to add new
structural elements.  This makes it relatively easy to convert to
SGML, as long as the TIM authors adhere to reasonable structuring
constraints (i.e. don't abuse constructs for different purposes).

TIM's primary weakness at the moment seems to be its toolchain, which
starts good (the parser it written in Python) but quickly runs into
problems on non-Unix platforms: for HTML generation it uses a Perl4
script, and for PostScript it goes through texinfo and hence through
tex.  For Unix, TIM's toolchain is perfect, however, and I like the
simplicity of its approach -- it should be simple enough to rewrite
the TIM-to-HTML converter in Python (maybe using HTMLgen?).

For Windows, it just *may* be possible that Word 97 will actually
parse the HTML generated by TIM so as to make it possible to generate
Postscript on Windows platforms with commonly available tools; in any
case, a prospective TIM author on a Windows platform would only need
the HTML generating part of the toolchain for on-screen previewing.

I'd love this discussion to come to an end.  I think that we would be
in good shape with TIM, *if* we solve two outstanding problems.  One
should be easy: rewriting the TIM-to-HTML tool in Python.

The other one is much hairier: conversion of the existing LaTeX source
to TIM!  This needs to be a high quality conversion, e.g. ideally it
should maintain comments and other aspects of source formatting
(like line breaks) that don't affect the generated pages but does
affect the human reader of the source, because the output of the
conversion will be edited manually henceforth.  On the other hand,
this only needs to be done once, so a small amount of manual tweaking
is acceptable.  The old conversion script (partparse.py) which I still
have laying around somewhere is probably able to do this with some
small changes (I sure hope those changes are small, because this is
one horrible piece of code... good for a one-off job though).

Those who want SGML or XGML should be able to convert TIM to their
favorite DTD using a different back end for the TIM front end.  I
would love specific feedback on the structural capabilities of TIM;
ideally, TIM should map directly onto a real SGML DTD as far as
document structure is concerned.  However, I don't want to compromise
TIM to make it possible to parse it with a generic SGML scanner; the
efforts to move HTML towards strict SGML scanner compatibility have
taught me a valuable lesson.

One final note: I looked at Perl's POD (Plain Old Documentation) for a
few seconds.  It's more limited than TIM and uses physical markup
(e.g. B<words in bold>), but has one feature that I like: a block of
indented text offset by blank lines (I believe) is automatically
interpreted as a code sample block (verbatim in LaTeX terms,
@codeexample in TIM).  This makes POD source remarkably readable.  I
presume that it would be trivial to add this to the TIM front-end.  (I
particularly like this idea because it's the same convention that I
used in the Python FAQ wizard. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)

_______________
DOC-SIG  - SIG for the Python Documentation Project

send messages to: doc-sig@python.org
administrivia to: doc-sig-request@python.org
_______________