[Doc-SIG] Docstring grammar: a very revised proposal

Fri, 21 Jan 2000 23:33:36 -0800

Over the last week, I've been incorporating all of the comments that
various folks made in the DOC-SIG about the docstring grammar
proposal, and writing code to provide a reference implementation.  It's
been going well, and I was polishing up to post it. In
preparation of a discussion at IPC8's DevDay, and anticipating
(perhaps mistakenly) some resistance to "yet another structured text
format" from Jim Fulton, I went to look at StructuredText to find out
what aspects of it I had problems with in the past. Just as a point of
historical interest, I have long been a fan of the idea behind
StructuredText, and have used a fairly old version of it to generate
much of my website automatically.  I stopped doing so because I found it
lacking
for the things I was trying to get it to do, unintuitive at times, and
too hard to modify for my (somewhat exotic) needs.

I just looked again at StructuredText's rules (not code) within the
context of "Python code documentation" exclusively (a much more narrow
domain than website management), and found it better-suited than I
remembered, with
three kinds of problems:

     - StructuredText is too generic -- in markup terms, it's much like
       SGML or XML without a DTD (or with a very weak one).  So while
       it's an OK markup, it's too weak as it stands.

     - StructuredText has some markup rules which I think are wrong, in
       the sense that they fail the "naturalness" test which we set out
       to use in our alternative markup or they cause "bad" side-effects
       when a docstring is not fully marked up.

     - Missing features (not surprising -- the aim is different)

I'll describe each of the  kinds of problems in detail, but first,
let's look at the rules which are followed by the *latest* version of
StructuredText.py, which is available at the Zope CVS site.  I include
the docstring for StructuredText.py here:

        Structured Text Manipulation

        Parse a structured text string into a form that can be used with
        structured formats, like html.

        Structured text is text that uses indentation and simple
        symbology to indicate the structure of a document.

        A structured string consists of a sequence of paragraphs separated
by
        one or more blank lines.  Each paragraph has a level which is
defined
        as the minimum indentation of the paragraph.  A paragraph is a
        sub-paragraph of another paragraph if the other paragraph is the
last
        preceding paragraph that has a lower level.

        Special symbology is used to indicate special constructs:

        - A single-line paragraph whose immediately succeeding paragraphs
are lower
          level is treated as a header.

        - A paragraph that begins with a '-', '*', or 'o' is treated as an
          unordered list (bullet) element.

        - A paragraph that begins with a sequence of digits followed by a
          white-space character is treated as an ordered list element.

        - A paragraph that begins with a sequence of sequences, where each
          sequence is a sequence of digits or a sequence of letters followed
          by a period, is treated as an ordered list element.

        - A paragraph with a first line that contains some text, followed by
          some white-space and '--' is treated as
          a descriptive list element. The leading text is treated as the
          element title.

        - Sub-paragraphs of a paragraph that ends in the word 'example' or
the
          word 'examples', or '::' is treated as example code and is output
as is.

        - Text enclosed single quotes (with white-space to the left of the
          first quote and whitespace or puctuation to the right of the
second quote)
          is treated as example code.

        - Text surrounded by '*' characters (with white-space to the left of
the
          first '*' and whitespace or puctuation to the right of the second
'*')
          is emphasized.

        - Text surrounded by '**' characters (with white-space to the left
of the
          first '**' and whitespace or puctuation to the right of the second
'**')
          is made strong.

        - Text surrounded by '_' underscore characters (with whitespace to
the left
          and whitespace or punctuation to the right) is made underlined.

        - Text encloded by double quotes followed by a colon, a URL, and
concluded
          by punctuation plus white space, *or* just white space, is treated
as a
          hyper link. For example:

            "Zope":http://www.zope.org/ is ...

          Is interpreted as '<a href="http://www.zope.org/">Zope</a> is
....'
          Note: This works for relative as well as absolute URLs.

        - Text enclosed by double quotes followed by a comma, one or more
spaces,
          an absolute URL and concluded by punctuation plus white space, or
just
          white space, is treated as a hyper link. For example:

            "mail me", mailto:amos@digicool.com.

          Is interpreted as '<a href="mailto:amos@digicool.com">mail
me</a>.'

        - Text enclosed in brackets which consists only of letters, digits,
          underscores and dashes is treated as hyper links within the
document.
          For example:

            As demonstrated by Smith [12] this technique is quite effective.

          Is interpreted as '... by Smith <a href="#12">[12]</a> this ...'.
Together
          with the next rule this allows easy coding of references or end
notes.

        - Text enclosed in brackets which is preceded by the start of a
line, two
          periods and a space is treated as a named link. For example:

            .. [12] "Effective Techniques" Smith, Joe ...

          Is interpreted as '<a name="12">[12]</a> "Effective Techniques"
...'.
          Together with the previous rule this allows easy coding of
references or
          end notes.

        - A paragraph that has blocks of text enclosed in '||' is treated as
a
          table. The text blocks correspond to table cells and table rows
are
          denoted by newlines. By default the cells are center aligned. A
cell
          can span more than one column by preceding a block of text with an
          equivalent number of cell separators '||'. Newlines and '|' cannot
          be a part of the cell text. For example:

              |||| **Ingredients** ||
              || *Name* || *Amount* ||
              ||Spam||10||
              ||Eggs||3||

          is interpreted as::

            <TABLE BORDER=1 CELLPADDING=2>
             <TR>
              <TD ALIGN=CENTER COLSPAN=2> <strong>Ingredients</strong> </TD>
             </TR>
             <TR>
              <TD ALIGN=CENTER COLSPAN=1> <em>Name</em> </TD>
              <TD ALIGN=CENTER COLSPAN=1> <em>Amount</em> </TD>
             </TR>
             <TR>
              <TD ALIGN=CENTER COLSPAN=1>Spam</TD>
              <TD ALIGN=CENTER COLSPAN=1>10</TD>
             </TR>
             <TR>
              <TD ALIGN=CENTER COLSPAN=1>Eggs</TD>
              <TD ALIGN=CENTER COLSPAN=1>3</TD>
             </TR>
            </TABLE>

-------

This markup has nice features:

    - it's very similar to the core of what we were discussing (no big
      surprise, since my proposal was mostly stolen from earlier
      StructuredText =).

    - The (new to me) mechanism for dealing with references to URLs and to
      other bits of text is quite elegant.

Ok, now onto the problems with the above markup rules, IMO of course.

StructuredText as XML without a DTD:

    Unlike the grammar which was discussed on the doc-sig, StructuredText
    does not allow us to specify special syntax for special kinds of text,
    such as: signature blocks/tooltip descriptions, doctest.py code,
    and keyworded paragraphs.  This is a lack of feature rather than a flaw,
    and could maybe be fixed by adding a postprocessor which would be
    Python-internal-doc specific.

StructuredText as a markup with some flaws for inline doc:

    The use of single quotes to markup inline code (as in 'x') can be
    surprising.  Many current docstrings use 'x' to refer to the
    *string* containing the character x, not the variable x.  In
    StructuredText, the quotes would dissappear in the rendering. With
    practice, the current scheme could be used but users would have to
    learn to write '"x"' to have their intent carry through to the
    renderer.  This is probably my biggest problem with StructuredText
    because I don't know how to fix it while maintaining
    compatibility. Related questions for Jim or Ken are 1) How does
    StructuredText parse ''?  2) How can one have a single quote in
    verbatim text?

    The tagging of underlined text with _'s is suboptimal.  Underlines
    shouldn't be used from a typographic perspective (underlines were
    designed to be used in manuscripts to communicate to the
    typesetter that the text should be italicized -- no well-typeset
    book ever uses underlines), and conflict with double-underscored
    Python variable names (__init__ and the like), which would get
    truncated and underlined when that effect is not desired.  Note
    that while *complete* markup would prevent that truncation
    ('__init__'), I think of docstring markups much like I think of
    type annotations -- they should be optional and above all do no
    harm. In this case the underline markup does harm.

    The requirement that a paragraph end with the word example or
    examples or :: goes against my natural style, as I often do not
    want such word or punctuation before a "displayed" paragraph.
    Furthermore, the spec currently doesn't say how the renderer is
    supposed to process the :: -- is it displayed as two colons, one,
    or none?  If the two colons are not displayed by the renderer,
    then my objection is diminished, although I would have preferred a
    markup which is local to the paragraph which is affected, not the
    previous one (cut and paste errors follow too easily). In some
    versions in the past at least, both colons were displayed.  I'll
    leave that as an open question, as additional markup could provide
    an alternative which would suit me.

Missing features:

    The definition of references is well-designed for referencing
    URLs.  The docstring proposal needs to address referencing other
    code elements (methods of current class, other classes, other
    methods of other classes, builtin modules, imported modules, etc.)
    This is also more of a lack of feature than a real flaw, and
    defining the namespaces for lookup of 'reference targets' would
    probably go most of the way towards fixing the spec for our needs.

    The DOC-SIG folks strongly wanted to be able to have list items
    not require blank lines in between them.  This is not hard to do
    from scratch (I've got code to do it).  I suspect it could be
    added to StructuredText as a postprocessing step.  (From
    experience, such list items should be required to be indented
    relative to the previous line, to avoid spurious bulletization.)

    I think many folks liked the idea of "tagged" paragraphs, as in:

        Author: Guido van Rossum

    and
        Release-History:
            0.1: June 1920
            0.2: July 1919

    etc.  Again, postprocessing smarts for this could be added while
    keeping backwards compatibility.  Many of the features related to
    these tagged paragraphs (internationalization of keywords, etc.)
    could be dealt within the postprocessor.

Technical points:

    StructuredText.py uses the regex module, which is deprecated.

    StructuredText.py currently rarely produces an exception.  The
    output may not be what you expected, but it will produce output.
    For a rigorous docstring markup, I'd expect to see much stricter
    rules enforced, which raises technical questions (the line numbers
    for example are not maintained as part of the StructuredText.py
    parsing process I suspect).  This requirement would probably have
    the most impact on the current source.

Assuming this analysis is correct, I am inclined to scrap my original
proposal, shelve my current prototype code, and work instead with Jim Fulton
and whoever else to discuss whether and how to modify StructuredText's
format
and code to extend it to the needs which the DOC-SIG expressed.  The value
in this approach is:

    - StructuredText.py (the code) exists and works. (prior art)

    - There is a fair bit of code which uses StructuredText's markup.
      (user base)

    - Much of the work would be postprocessing of StructuredText.py
analysis.
      (modularity)

    - The features of the current markup which are deemed problematic could
      be disabled with a runtime switch if the StructuredText community
      disagrees with my assessment. (configurability)

An alternative course of action is to take StructuredText.py and
modify and rename it (aka code fork) This is of course less desirable,
but would be the efficient way of dealing with either irreconciliable
differences of opinion or differing needs (It is possible that the
Zope folks do not want to see their code burdened with e.g. line
number maintenance  code because of performance or other reasons).
We also need to know whether the Zope folks would *want* to
push StructuredText.py into the Python standard library, which I have
always assumed to be a goal for whatever tool we come up with.

A final alternative is for me to revise my current (unpublished) manuscript
to
incorporate some of the things I like about StructuredText which we didn't
have
(the reference naming scheme mostly), and keep a parallel track in spec and
code.

Just as a point of note: I do maintain that keeping StructuredText as it
currently is
without postprocessing is inadequate for intelligent docstring markup, and
do not
consider that a solution to the problem at hand, although it's a fine
everyday
strategy in the absence of a solution.

I would like to discuss which approach should be followed at the
DevDay session.  Feel free to email or post feedback between now and
the conference, whether or not you'll be there.  Assuming a discussion
does take place, I'll try to post a summary post-conference.

Cheers,

David Ascher