[Doc-SIG] Structuring rules

Edward D. Loper edloper@gradient.cis.upenn.edu
Sun, 08 Apr 2001 17:46:49 EDT


I've been working on designing a docstring markup language, based
losely on ST and a few other sources..  And I wanted to see what
people thought of the structuring rules so far.  Note that these
rules only talk about how to indicate the *structure* of a docstring,
not coloring (things like emph and inline code).

There are 9 structural blocks:
  - basic block: these blocks do not contain other blocks
    - paragraph: a paragraph of text.  paragraphs are the only place
      where coloring (emph, etc) can occur.
    - literal block: a block of unprocessed text, which will be 
      displayed as-is.
    - doctest block: a block containing python code, which can be
      used by doctest. 
    - heading: a single line of text, providing the heading for a 
      section.
  - hierarchical blocks: these blocks contain other blocks
    - list item: a single item of a list.  List items can contain
      paragraphs, literal blocks, doctest blocks, and lists.
    - list: a list.  Lists contain one or more list items.
    - section: a section or subsection of text.  Contains a heading
      followed by paragraphs, literal blocks, doctest blocks, lists,
      and sections).
    - field: a semantically tagged section of text.  It is used to
      describe specific aspects of an object, like the return value,
      a parameter to a function, or the authors of a module.  
      Contains paragraphs, literal blocks, doctest blocks, and lists.
    - top: the top-level.  contains paragraphs, literal blocks,
      doctest blocks, lists, sections, and fields.

In case you're not familiar with these blocks, here's an example::

  This is a one-line paragraph.

  This is a multi-line paragraph.  Paragraphs are usually
  separated by blank lines.

      - This is a list.
      - Lists consist of list items.  List items may span
        multiple lines.

        List items may contain multiple paragraphs.

  Blocks
  ======

  That was a top-level heading.  Here's a subheading:

  Literal Blocks
  --------------

  Literal blocks are introduced with double-colons, like this::

       Literal /
              / Block

  And end on the first line whose indentation is equal to or less
  than the indentation of the paragraph that introduced them.

  Doctest Blocks
  --------------

  Doctest blocks start with '>>> '.  Here's a doctest block:

    >>> print 1+2
    3

  author: This is a field.  This particular field should be used 
          to describe the author of the object documented.
  param x: Fields can take arguments.

So..  the markup language I'm defining makes a fair amount of use 
of the concept of "indentation."  Instead of defining it right now,
I'll just show it by example.  I'll worry about formalizing it later::

   This paragraph has an indentation of 3, since it is 
   preceeded by 3 spaces.

      >>> # This doctest block has an indentation of 6
      >>> print("   even if some of its lines are indented more")
         even if some of its lines are indented more
      >>> # Indentation of a doctest block is the indentation of
      >>> # its first line.

   The following literal block has an indentation of 4::
       Literal Block!
   That's one plus the indentation of the paragraph that introduced it.

      - This list has an indentation of six
      - That's because each list item is preceeded by 6
        spaces.  This list item has an indentation of 8.

        That's because each of its paragraphs has an indentation
        of 8.

   Heading
   =======

      That heading had an indentation of 3, since it was preceeded 
      by 3 spaces.

      This section has an indentation of 6, since each of its
      paragraphs has an indentation of 6.

   Heading2
   ========

   This section has an indentation of 3.

   author: Field indentation works just like list item indentation.

Now we can discuss what rules to put on indentation.  These rules can
be used when parsing to figure out where blocks start/end etc.  I
propose:
  - all paragraphs must be left-justified.  i.e., the indentation
    of each line in a paragraph must be the same.
  - the indentation of a paragraph must be equal to the indentation
    of the block that contains it.
  - the indentation of a list must be greater than or equal to the
    indentation of the block that contains it.  Although I might
    consider changing this to strictly greater than.
  - the indentation of a list item must be strictly greater than
    the indentation of the list.  In other words, the following type
    of list item is not allowed::

        - a list item where the indentation of the paragraph is
        equal to the indentation of the list.

  - the indentation of a field must be strictly greater than the
    indentation of the block that contains it.  Thus, the following
    is not allowed:

        field: a field where the indentation of the field is
        equal to the indentation of the block that contains it.

  - the indentation of a section must be greater than or equal to
    the indentation of the block that contains it.

But this leaves open the question of how to figure out the indentation 
of certain entities, such as:
  - a paragraph starting on the first line of a docstring
  - list-items with one-line paragraphs
  - list-items with one-line paragraphs followed only by sublists
  - fields with one-line paragraphs
  - fields with one-line paragraphs followed only by sublists

For now, I'll set aside the issue of dealing with the first line of a
docstring.  I see two basic options for dealing with the rest of the 
issues.
  1. The indentation of a list item is the number of characters
     before the first non-space character following the bullet.
     Thus, the following would be an invalid list item::

        - This is a list item, where the number of characters before
             the first non-space non-bullet character on the first 
             line doesn't match the indentation of the subsequent 
             lines.

     But you could say, for example::

         - List item
             - sublist item 1
             - sublist item 2
           Another paragraph in the top-level list-item.  Note that
           its identation matches "List item"'s indentation.

     You could also say something like::

         1.  A list item that
             spans multiple lines
         [...]
         10. Another list item.  Note that the use of an extra space
             in (1) makes this line up prettily.

  2. The indentation of a list item is indeterminant unless there
     is a paragraph that constrains it.  Thus, for example, you 
     could say::

         - This is a multiline list
             item.

         - List item
               - sublist item 1
               - sublist item 2
             Another paragraph in the top list-item.

I see 2 main problems with approach (1):
  - it doesn't work well if you try to use a non-monospaced font for
    docstrings, since it's hard to tell if it's "lined up."
  - it may not be convenient for labels::

        param x: you have to line up with
                 the first line, like this.

    You can't go like this::

        param x: a multiline description
            of parameter x
        return: a multiline description of 
            the return value

I see 1 main problem with approach (2):
    - if a list item contains a one-line paragraph, then the 
      list item's indentation is indeterminant, so you can't
      figure out the indentation of a child literal block.  E.g.::

          - list item::

                What's the indentation of this literal block??
      
            Is this another paragraph in the list item, or part of
            the literal block?

Thoughts/comments?

-Edward

p.s., requiring paragraphs to be justified, and requiring
lists to be indented, gets rid of the problem of accidentally 
word-wrapping a sentence ending in 1.  There's still a minor
problem if we go with approach (2), since you can't tell if
the second line is a list item or a continuation of the first
line in::

    - a list item with a sentence that ends in 
      1. That's not easy for humans to parse,
      either, though. :)