[Doc-SIG] Structured Text

Tony J Ibbs (Tibs) tony@lsl.co.uk
Tue, 6 Mar 2001 10:05:01 -0000


Edward D. Loper came up with some useful questions about StructuredText,
which are his and are delimited by ">"...

> I've been going over the definitions of structured text (and its
> various flavors), trying to see if I can formalize it even more than
> Tibs did (http://homepage.ntlworld.com/tibsnjoan/STNG-format.html and
> http://homepage.ntlworld.com/tibsnjoan/docutils/STpy.html)...  And a
> number of questions came up.  I'm not sure if this is the correct
> forum for such questions..

So far as I'm concerned, it is.

(Historically, actual implementation of a tool used to founder on such
discussion because it used to occur *before* implementation. However,
since the implementation is proceeding, questions are, by me, now a good
thing. Also, in practice, certain points only become clear to one when
one has to implement them!).

BTW - if you can, look back in the list archive for comments by David
Goodger, dated around November last year. They're on my "to comment on"
list (actually, they're in the bundled TGZ or ZIP files of the docutils
distro as well, now I think of it), and (from memory) had some
interesting points to make.

> 1. Does every string value have an interpretation as a Structured
>    Text?  That seems to be the case.  If so, is that a Good Thing?
>    As an example of a string that we might not want to give a value,
>    consider:
>    ||    indent level 0
>    ||
>    ||            indent level 1
>    ||
>    ||                    indent level 2
>    ||
>    ||        indent level ??
>
>    I'd really prefer not to have cases like this have "undefined
>    semantics."  It seems like we either need to specify what they
>    mean, or say that they're illegal.

As with many things about ST, the answer is "that depends" - it's really
partly an implementation issue, conditioned by the required audience. In
the case of Zope's use of ST, HTML pages have to be generated on the fly
from ST, and thus it is not acceptable to have unrecoverable errors
which prevent that (it is better to produce a possibly-slightly-wrong
rendering). In the case of something like pydoc, this is probably true
as well. In the case of a developer "testing" their docstrings, they
want to *know* about errors.

So. In all cases the example given is illegal, and the result is
formally undefined (both 'cos it's not written down anywhere, and 'cos
it *should* be undefined). However, STClassic (the "original" Zope tool,
also used by HappyDoc, for instance) will make a "best guess" about what
is meant - I can't remember any details (I only skimmed that code), but
I don't think it went to *too* great lengths. Currently, my tool will
flag an error and either give up or ignore the text (I think the
former) - this is clearly unacceptable in many cases, and is on the
"fix" list. My feeling is that the user should be able to select
"pedantic" mode, in which it complains and gives up, but in non-pedantic
mode (the default) the "nearest" (in some sense) correct indentation
should be selected.

On the other hand, I don't actually believe that *every* string has an
ST representation - just that this is not the example (I don't off-hand
*have* an example, just a gut feeling).

Hmm - there are, though, surprising things one cannot represent as ST -
here's an awkward case::

	1. This is a list item::

	    This is some literal text (see the '::' above)

      I can't make this paragraph a "child" of the list item.

My intent is to make the third para a child of the first (so it won't
terminate the list). But I can't do that, because it can either be
indented so that it becomes part of the literal text (not what I want),
or it can be indented as it is. This is so regardless of how one ends
the literal text (one can choose "when a paragraph at the parent
paragraph's indentation is found" (in which case the third paragraph
*has* to be at that indentation to be "seen" as a separate paragraph),
or "when a paragraph with indentation less than the first line of
literal text is found" (in which case the third paragraph *could* be
indented more, but it would be illegal by the indentation rules)).

There *is* a way round it, I've just realised, which *may* not work yet
in my code, but will do so eventually:

	1. This is a list item:

	   ::

	      This is some literal text (see the ':: above,
	      which will be optimised away/become invisible)

	   And now this paragraph *is* a child of the list item.

(by following the ST rules blindly, a paragraph containing just '::'
should be legal - one then just has to decide what it *means* and how it
is rendered. I choose to take it as invisible - but I can't remember if
I've implemented it yet - currently it probably renders as a single
':'). It's a hack, but it *does* follow the rules.

(((hmm - time to refine terms.

ST == StructuredText - I use this as a generic term for the "family"
STClassic == StructuredText as implemented by Zope, and used by, for
example, HappyDoc
STNG == the new version of ST that Zope has been working on - I don't
know its current status, as I haven't been following it recently.
STpy == ST with extensions from STNG, added "Python" extensions, and
possibly other extensions as well (as documented in the STpy.html
document)

There is a widely available module to implement STClassic. STNG was
being worked on, and is available through CVS via the Zope website. STpy
is currently partially implemented by my code.)))

> 2. If it is true that every string value has an interpretation as a
>    Structued Text, does it make sense to officially "discourage"
>    certain types of strings, such as the example listed above?  It
>    might also make sense to discourage strings like:
>    ||    this
>    ||      is
>    ||  one messed up
>    || paragraph

Again, it depends if one is a document user or producer. For a document
*producer*, we should discourage (1) aggressively. For a user, we can't.

As to (2), I can't speak for the Zope codebase, but in mine it is only
the first line of a (non-literal) paragraph that counts, and so the
messed up paragraph will render predictably (and I choose to think that
is the sensible choice).

Note that it is a Good Thing to make only the first line count - it
means that people who prefer paragraph styles like::

	  I start my paragraphs with an
	indentation 'cos my teachers told
	me to...

can get away with it (well, they're mad, but they can get away with it).

> 3. Which types of "code coloring" (emph, inline, etc.) can "wrap" over
>    lines, and which can't?  E.g., can I have an *emph statement that
>    continues to the next line?*

In STClassic, the '..' markup (literal) cannot contain line boundaries.
emphasis and strong can. Descriptive list item titles can't.

In my code (at the moment) in non-literal paragraphs all line boundaries
have disappeared before the colourising code gets at the text, so all
markup can span line boundaries (descriptive list items still can't, but
that's a different issue). It was trivial to choose either option in my
code, and I prefer the ability to ignore line ends - I like to use (more
or less) ST in my emails, but I'm using Outlook <fx:spit> which doesn't
make it easy to tell where it will wrap lines, which makes it hard to
use STClassic type literals. Note that a newline-plus-leading-whitespace
is exactly equivalent to a single space.

> 4. Is there any official precedance ordering on the different
> types of "code coloring?"

Yes. It is determined by the innards of the software. In my code, there
is a list that the user can get access to and modify, which can change
the order of markup (actually, the order of markup and also the RE used
to detect each markup, and the tags generated therefrom). But I wouldn't
expect this to be documented as such for ST itself - it's an
implementation detail of a particular tool.

> Any rules about what
> types of code coloring can be contained in what other types?

In a system using REs to recognise markup it's, well, tricky to nest
markup. I don't *think* STClassic does. My code deliberately refrains fr
om doing it at the moment - it's something I intend to implement
eventually, but since it's fiddly and non-essential I'm leaving it for
now (*how* to do it is fairly obvious, but it's fiddly and the sort of
thing I have to draw a lot of pictures for, so I'd rather do more
important things first, even though I myself *like* to nest markup).

> 5. Does structural formatting or code coloring take precedance?  For
>    example, if a paragraph starts with "* foo *," will it be a normal
>    paragraph with an emphasized first element, or a list
> item?  (It'll
>    be much easier for me to write formal rules if structure takes
>    precedence. ;) )

Hah - the devil is in the details. Which have never been formally
documented. So the answer is, it depends.

In STClassic, I can't remember, so you'll have to experiment.

In my code, list item recognition is done in two places - first at
paragraph splitting time, and secondly when the nodes are "combed" for
various purposes. The same REs are used both times. Colourisation is
done later (when the doctree is already in shape).

Thus, a paragraph starting "* spam *" will be recognised as a list item,
and the initial "*" will be removed well before colourising happens
(mind, I'm saying that, but I haven't tried it!).

(Hmm - like much of ST, this just tends to "fall out" of the design,
which is one of the reasons I have immense respect for the people who
came up with the original ideas - it manages to retain great economy of
markup use, whilst generally "doing the right thing" in most normal
situations)

> 6. Among the list types, which take precedence?  For example, if a
>    paragraph starts with "1. foo -- bar", is it an ordered list item
>    or a descriptive list item?

Depends on the implementation. I can't offhand remember (although it's
another thing a user of the module can fiddle with in my version). The
order *does* matter - OK, I've looked it up, and the relevent code is:

    # Note that we want the descriptive list type to come
    # before the others, since a title might be a valid
    # bullet or sequence
    RE_list = [(RE_DESCRIPTIVE,"ditem",1),
               (RE_UNORDERED,  "uitem",0),
               (RE_ORDERED,    "oitem",0)]

I'd stand by that - to make sense, descriptive comes first, and the
order of the other two doesn't matter. So your example should be
descriptive (title == "1. foo", text == "bar").

Do beware of the " -- " trap for descriptive list items - one wants to
be able to do::

	A list of delimiters:
	   'o'    -- starts an unordered list item
	   ' -- ' -- separates the parts of a descriptive
	             list item

and have it produce the "obvious" results (I keep hoping I've got that
bit right!)

> 7. What is meant by saying that SGML text passes through?  SGML isn't
>    even a mark-up language, so I assume that the intent is something
>    like "XML and HTML text passes through."  But does that mean that
>    in an expression like '<TAG>a*b*</TAG>', the '*'s will be ignored?
>    That seems unreasonably difficult to implement.  What about an
>    expression like '<T a="*x*"/>'?  Does this mean I can't say things
>    like if 'x<y *and* y>z'?  Is there strong support for the
>    notion of letting "SGML" text pass through, or is it something that
>    might be dropped?  (I would certainly vote for dropping it. :) )

In STClassic, there was an embedded assumption that HTML was being
produced (since it was designed mainly for support of Zope web pages).
Thus what was meant was that anything that looked vaguely HTML-like
would be passed through untouched (so I think it just triggered on the <
and >, but I wouldn't swear to it).

In my code, and STpy, this assumption is thrown away - first of all
because one cannot assume that HTML (or some other SGML child) is the
target, and secondly because experience with LaTeX and other markup
languages shows that allowing this sort of latitude to document writers
is a Bad Thing.

Of course, this does mean that one has to start to consider things like
line breaks that STClassic would handle with an explicit '<br>' - but
that can wait...

> My eventual goal, to the extend that it's possible, is to write out a
> complete formal specification for StructuredText using something
> similar to BNF (Backus Naur Form).  (I'm pretty sure that vanilla BNF
> is not powerful enough to capture StructuredText.)

That would be good. If it helps, I should be working on a DTD for STpy
in the next week or so, which should relate, and will define some names
for things...

And, of course, given a BNF, presumably we can get someone else to write
another parser (I am resolutely sticking with REs for docutils because
(a) they're always present with Python and (b) they provide Unicode
support, but a "proper" parser driven approach would be a Nice Tool to
have out there - of course, given my druthers, I'd be using mxTextTools,
which is a Very Nice Tool, but that's *definitely* not going out in the
standard Python package!)

>  After I've done
> that, I'll start working on getting Emacs to colorize StructuredText
> strings.

Yeh! Now that would be a truly good thing to do. But not as important
as:

>  I'd also like to create a sort of test-suite set of strings
> to test how different implementations function on different
> "ambiguously defined" cases..

Oh, yes please - that's even more useful - stress cases are quite
difficult to come up with. Can I suggest searching through the docutils
sources for occurrences of the string "hmm" (case insensitive), since
that will show up some of my own comments on oddities of the form?

(Also big on my list of future needs is a "howto" document that has a
section on heffalump traps - things that are difficult to do in ST, and
wayward things that produce unexpected results. Your list of test cases
would also be a Big Help in such a thing.)

Tibs

--
Tony J Ibbs (Tibs)      http://www.tibsnjoan.co.uk/
"How fleeting are all human passions compared with the massive
continuity of ducks." - Dorothy L. Sayers, "Gaudy Night"
My views! Mine! Mine! (Unless Laser-Scan ask nicely to borrow them.)