[Doc-SIG] formalizing StructuredText

Mon, 19 Mar 2001 10:27:57 -0000

Edward D. Loper wrote:
> Yeah, I've been playing a bit fast and loose with terminology in my
> emails.. :)  Speaking of terminology, I want to make sure that we're
> using somewhat consistant terminology.  In particular, I think my
> use of the following terms may not coincide with what you call
> things.  What are your terms for the following?
>
>     * inline = region marked with #hashes#.

Python literal string. And something with 'quotes' is a literal string.

>     * paragraph = a text paragraph; not a list item or a heading or
>       a label

Paragraph (not distinguished normally from the other sorts, which *also*
have special names). If I had to distinguish this, I'd probably call it
a "paragraph with a blank line before it" (remember, that *might*
include the other sorts of thing, too).

>     * basic block = paragraph or list item or heading or label (or
>       table?)

Paragraph (see above)

>     * blank line = (S* NL) | (S* EOS)

blank line

>     * literal block = region following a '::'.

literal paragraph (which is a *bit* misleading, as it can include blank
lines)

and a single (non-literal) paragraph starting with '>>>' is a Python
paragraph.

>     * invalid string = string that is not given a meaning by an ST
>       variant.  (in the terms used by the STminus proposal, strings
>       that are not assigned a structure by a language).

I don't have a term for that, because docutils doesn't work like that. I
*have* started to generate paragraphs that have a "badpara" tag, so it
would be a "badpara" element (I'm following the old-fashioned ST
approach of trying to markup what the user said and assuming they meant
it, whilst you're trying to do the formal approach - this does leave a
gap in talking).

> Tibs continued:
> > When I am talking, I have some assumptions (which, of
> course, may not be
> > evident):
> >
> > 1. by the time discourse occurs, all tabs have gone away
>
> Agreed.  We should probably also discard/transform any whitespace
> that isn't space or newline (e.g., form feed, carriage return).

Agreed, but something I've ignored for now (unless my code does it
without my looking - doutbful).

> > 2. blank lines are blank lines - white space in them is ignored
> >    thrown away (lost for good)
>
> Is this true in literal blocks?

Yes - by the "trailing whitespace is removed" rule.

> Also, I'm guessing you collapse multiple consecutive blank lines
> into one.

Yes, but they get un-collapsed again within literal paragraphs (that's
quite important, and a major deficiency in STNG, if it's still not
done).

(this does not, of course, happen for *Python* literal paragraphs, as
they are defined to end at the first blank line - indeed, that (or end
of string) is *all* that ends them.)

> > 3. trailing whitespace is thrown away
>
> Trailing whitespace for the string as a whole?  For each basic
> block?  For each line?  Is this true in literal blocks?

For each line. True in all places (you can't, in general, see them, so
there we go).

For literal blocks, newlines are preserved, but I can't see any obvious
point in preserving trailing spaces.

> > 4. literal paragraphs retain leading whitespace following "the
> >    rules" (which say they are actually indented relative to the
> >    preceding non-literal paragraph - this makes much more sense
> >    in ST than "with respect to the left margin").
>
> Agreed.  Although how do you put something at zero indentation?
> Maybe indent from 1 space over from the preceeding paragraph?

You don't. I've never wanted to (my problems with HTML normally come
from trying to do the opposite).

> So we won't use the term whitespace.  Instead, we'll use the terms
> space, newline, and blank line.

Good by me - it also requires one to say "space or newline" when that is
what one means.

> > Clearly for a string literal that does not contain a
> newline, spaces are
> > to be transcribed to spaces (probably - flag a rendering issue as to
> > whether they're *hard* spaces (the correct number) or *unbreakable*
> > spaces (the correct number AND no newlines)).
>
> I vote for unbreakable, but it may be possible to persuade me.

Given I've now forbidded newlines in (both types of) string literal
again, I'd also go for unbreakable (my HTML output doesn't implement
that, but who cares, it's only a testbed, and could be fixed later on).

> > Equally clearly, if one does not allow newlines in string literals,
> > that's the end of the matter. We've done our job.
>
> Which is what I vote for. :)

And I now agree. My position wasn't strong enough to stand against
nay-saying, I felt.

> > >     "Here the *name* 'contains' markup":url
>
> Hm.. I'm confused.  So you would get::
>
>   <a href="url">Here the *name* 'contains' markup</a>
>
> ?  Or::
>
>   "Here the <I>name</I> <CODE>contains</CODE> markup":url

Well, personally I'd never emit <I>..</I> - but I had missed the
preceding '::', so was answering on a different assumption, I think...

At the moment, markup is not nested. At the moment, literals are scanned
for first. So at the moment, a URL text containing a literal string will
not be identified as such. In the future, I aim for markup to nest, and
then your example would be legal, and do the "right thing".

>  One other case to consider is::
>
>   *"I would prefer this":url* to "*this*":url

With no markup nesting, and with the current ordering, the first one is
emphasised, so not a URL usage, and the second one is a URL, and so not
emphasised.

Of course, if we're aiming at 2.2, then it is quite possible nested
markup might be available by then - I'm just not prepared to "waste"
time on it now...

> > >     "This name spans multiple
> > >     lines":url
> >

Revised answer - that's definitely allowed, as newlines are explicitly
allowed in the quoted part of a URL definition. Why? Because it's not
harmful, it's a bit surprising if they're not (since they're allowed in
*other* ".." situations), and I prefer it that way (erm...).

> Still seems to me that names should be able to span newlines, though.

So I think we're agreeing.

> > >     "the following is not a url":<hi>

That's right. In this instance.

> Yes, but do we get an error because we used '":' in a silly context
> (if we're asking the parser to tell us about errors)?

I can't see, in docutils (STminus is another kettle of fish) that error
detection (apart from paragraph indentation and paragraph label
detection) is other than a bunch of heuristics, almost certainly one or
more REs, that point out *possible* problems to a user wanting
validation. So it becomes a matter of identifying the set of REs we want
to warn about.

> > >     Do *quotes "have to* nest" properly with coloring?
>
> But from the point of view of formalizing things, I have two
> choices here:
>    1. say that it contains a bold region, and the quotes are just
>       rendered as quotes
>    2. say that it's undefined (i.e., an invalid string).

Undefined isn't invalid - it's undefined. At least to me, even in a
formal context, that's true (i.e., not "I don't know" but "I shan't
decide").

On the other hand, once I'm sure I've got the order of
markup/colourising correct, I'll be happy to regard it as so, and then
you could "freeze" it. But is that a good approach?

Tibs

--
Tony J Ibbs (Tibs)      http://www.tibsnjoan.co.uk/
Give a pedant an inch and they'll take 25.4mm
(once they've established you're talking a post-1959 inch, of course)
My views! Mine! Mine! (Unless Laser-Scan ask nicely to borrow them.)