[Doc-SIG] XML Conversion Update

Thu, 26 Aug 1999 17:14:29 -0400 (EDT)

  Last week I promised on the Python list to describe the current
status of the conversion to SGML/XML.  Here it is!

  I'm currently able to parse all the LaTeX markup and generate either 
XML or SGML.  The structure of the output is very similar to the input 
structure, but a number of minor improvements are made.  The
improvements are mostly very localized and have more to do with
keeping the markup from getting to complex and disjointed, and
eliminating some bogosities.
  I am not at all decided on a DTD to use.  I see three options:

  1.  DocBook -- this has been developed and heavily use-tested by a
      number of corporate users, and is known to be good for technical 
      documentation.  There are tools and stylesheets available to
      convert from DocBook to HTML and printed formats.  We'd probably 
      need to specialize it, but it's designed for that.  Konrad
      Hinsen has already developed one customization that he's using
      to document Python modules, and there's an initiative to create
      a common extension for documenting OO constructs.  I've asked
      Konrad for some sample documentation so I can see how it
      actually works out.  My concern with DocBook is that the markup
      may be a bit on the "heavy" side; I don't want the document
      source to be so markup-heavy that I'm the only one to work on
      them.

  2.  Create something similar to what we had in LaTeX, but with fewer 
      warts.  This is appealing because the conversion would be done
      sooner.  However, new stylesheets would be needed, slowing down
      the usefulness of the result.  It would also be the easiest to
      adopt for people already familiar with the current markup.

  3.  Create something entirely new and specific to Python.  Clearly,
      this offers a lot of work to all the volunteers.  We'd need
      requirements analysis, DTD design, stylesheets, and probably
      lots of things I haven't thought of.  However, it also means we
      can limit the weight of the markup in the source, which might be 
      a major advantage in getting people to use it.  But *everyone*
      would have to learn it (well, everyone that writes documentation
      at any rate).  This offers a great deal of opportunity to "get
      it right" for Python, but also a lot of rope.  (You know what
      rope is used for, right?)

  I'd like to see some discussion on what should be done and what
needs to be done.  From where I sit, the most important thing is to
make sure we can maintain a high level of semantic markup (hopefully
making further improvements over what we already have), with
generation of hypertext (HTML, info, whatever) being the next most
important thing.  Typeset documents are a requirement, but aren't as
high up the list.
  I'm not terribly concerned about how XML/SGML-->foo conversion
processes are implemented, with the caveat being that I need to be
able to understand them without a massive learning curve.  Clearly,
Python code is a major option for tools (surprised?), but I can easily 
deal with using Java tools (with or without JPython), DSSSL processors 
(just don't expect me to maintain Jade/OpenJade!), XSL, CSS, and
whatnot.  I'd like to get away from having any Perl scripts involved,
not because I think Perl is Evil, but because I'm not a Perl hacker.
  (Don't get me wrong; I make no claim that Perl is not Evil! ;)
  Comments, suggestions, volunteers?

  -Fred

--
Fred L. Drake, Jr.	     <fdrake@acm.org>
Corporation for National Research Initiatives