[Doc-SIG] docstring grammar

Edward Welbourne Edward Welbourne <eddyw@lsl.co.uk>
Tue, 30 Nov 1999 12:44:47 +0000


David Ascher wrote:
> I propose that part of the definition of a keyword is (along with any
> special parsing rules) whether it can be duplicated in a docstring.

FAPP we can approach both this and the `context-sensitive' stuff from
the same point of view as SGML: precisely because

Blah:
    something legitimate
    in a Blah block

maps directly to

<BLAH>
something legitimate
within a BLAH
</BLAH>

so all the kinds of rule that a DTD could have imposed on BLAH are
sensible things to impose on Blah.  In particular, rather than `whether
it can be duplicated in a docstring' we have a nested tree structure in
our hands, so we can ask whether it can be duplicated as a child of its
parent.  Suppose Date to be unique:

Author: David Ascher
Release:
    Date: 1999/Nov/28
    Name: proto-post-gendoc:0.2
    Media: e-mail
Bugs:
    Report:
        Date: 1999/Nov/29

        As initially specified, the denotation for italic conflicts with
        python identifiers where these genuinely start and end in
        underscore.

        Status: resolved, adopting gendoc's approach

    Report:
        Date: ...

in which Date is `unique' but shows up many times in one doc-string.  I
would suggest that a tag is either unique in all contexts that allow for
it, or in none (so we don't have ickiness in which *some* tags allow
several Date subordinates - that kind of stuff makes it harder for folk
to remember what's unique and what isn't).  The right layer

A note on tags: we seem to be headed for `python identifier followed by
a colon'.  I'd like to argue for RFC 822 headers - that is,
specifically, to allow hyphens, so as to allow

Bugs:
    Reports-to: doc-sig@python.org
    Report: ... as above ...

and, indeed, to change Bugs: to Known-bugs:

Of course we could use _, but hyphen comes more naturally to text and
the parser for our keywords (unlike that for python identifiers) doesn't
have to worry about subtraction as `something we might be doing here' to
confuse with recognising the keyword.


For the sake of a coarse reprise of where I think we are:

Within docstrings, paragraphs, `text fields', descriptive and bulleted
lists are marked up using pretty much what gendoc used, though we seem
to be making some tweaks.  The main addition of David's proposal is a
structured data format entirely analogous to a *ML's begin-end
structure, but transformed to indent/dedent format - in exactly the same
way that one transforms the begin-end structure of C or Pascal into
python code.  This gets us all the desiderata that XML would provide,
but it does it in a pythonic format.

The typical block allows (depending on the keyword which introduced it)
an assortment of keywords to be used to introduce sub-blocks; it may
also allow paragraphs and/or lists within it.  The docstring is a block
which is willing to hold all `outer' structural groups (i.e. top-level
keywords, with their blocks, and paragraphs).  A paragraph is a block
which (possibly along with the blocks started by some keywords) may have
sub-blocks which are list items.

We can effectively write the rules for all this as a DTD and parse it
into a form which can be manipulated *as if* it had been obtained by
parsing a lump of XML - in particular, it should be trivial to perform
XSL-ish tree transformations to convert it to whatever DTD The Manual
wants as its input; while leaving ample scope for the inventive
toolwright to perform sophisticated information massaging on docstrings,
and not obliging us to use all that ugly XML taggery in the source.


We need a moderately short list (of order a dozen) of `top-level' tags:
subordinate to each we may introduce a few others (context sensitivity)
but simplicity demands vocabulary restraint and re-use.  The top level
seems to run to:

In all docstrings:
   Author(s), Release, Contributors
   Example(s), Test-script, Code
   Warning

In docstrings of callables:
   Argument(s), Return, Raises

In docstrings of classes:
   Supports/Implements/Mimics... (one synonym)
   Subclassing (for folk using this class as a base - what to override)
   Attributes, Methods (each supporting Private and Public as subordinates)

In docstrings of modules:
   Contents

so 7 universally-applicable keywords, (up to) the rest of a dozen in
each of the specific contexts for docstrings.  I would reckon we can
keep to about another dozen keywords spread around as subordinates of
the above (Date, Private, Public, Expect (for Test-script), Required &
Optional (for arguments), ...).


On test-scripts (in the manner of Tim Peters) we may not need a
Test-script keyword at all: simply using >>> is how the tool recognises
it, and there's nothing to stop the docstring parser recognising this as
a special indent mark that transforms to target XML *as if* it had come
from a block introduced by Test-script:.

	Eddy.