[DOC-SIG] Comparing SGML DTDs

Paul Prescod papresco@technologist.com
Wed, 12 Nov 1997 13:08:31 -0500


Guido van Rossum wrote:
> I think that SGML is not fit to be typed by humans 

Hundreds of thousands of HTML page authors would be surprised to hear
you say that!

> (SGML was designed to be typed by humans in the age of punched cards.)  

SGML was standardized in *1986*. This is around the same time Sun
Microsystems was becoming a Very Large Company and Bjarne Stroustrop was
starting to flog C++. GML, the older precursor to SGML, used a totally
different syntax and even then, I don't think anyone ever used punched
cards with it. Mainframe terminals, yes. Punched cards, no.

Furthermore, SGML was very carefully designed to be typeable:

"The markup rigorously expresses the hierarchy by identifying the
beginning and end of each element in classical left list order. No
additional information is needed to inerpret the structure, and it would
be possible to implement support by the simple scheme of macro
invocation discussed earlier. The price of this simplicity, though, is
that an end-tag must be present for every element.

The price would be totally unacceptable had the user to enter all of the
tags himself. He knows that the start of a paragraph, for example,
terminates the previous one, so he would be reluctant to go to the
trouble and expense of entering an explicit end-tag for every single
paragraph just to share his knowledge with the system....With SGML,
however, it is possible to omit much markup....After using tag
minimization there has been a 40% reduction in markup, since the
end-tags for three of the elements are no longer needed.

The document type definition enables SGML to minimize the user's text
entry effort without reliance on a 'smart' editing program or word
processor. This maximizes the portability of the document because it can
be understood and revised by humans using any of the millions of
existing 'dumb' keyboards."

> Latex has the same problem
> (especially the underscore is painful).  I think something else should
> be used that can be converted to SGML (or XML for all I care).  TIM,
> which has only one magic character (@, which isn't used in Python)
> fits the bill -- it did one or two years when I looked into it, and
> it's only because of inertia (and a lot of other things that needed to
> happen sooner) that I haven't started using it.
>...
> I just don't like the fact that SGML makes characters that occur
> frequently in Python source code like "<" and "/" special.  

SGML has two magic characters. "<" and "&". "/" is only magical when it
is either in a tag, or is used with that special tag <emph/minimization/
that I showed you. If you want to put a slash, you just don't use that
minimization: <code>a/b=c</>.

As far as the other two special chars: SGML has three solutions to
putting Python code in a document, depending on your needs You can make
an element with CDATA declared content like this:

<eg>
c=a<<b/d
</eg>

The only character sting the contents of the EG cannot contain is "</".
If you need to include that character string, then you must do this:

<eg>
<![CDATA[
 a = j</5
]]>
</eg>
 
I can't think of a context where you would need that in Python, though.
Finally you can use entities:

This is some inline Python code <CODE/a=b&gt;c/.

Note also that SGML allows you to change *all of* the delimiter
characters though that is a fairly drastic step (and I wouldn't usually
advise it).

> Yes, I think the library reference is a separate project from the
> tutorial.  I am planning to do the tutorial in FrameMaker because it
> gives me as an author the best user interface for editing and the most
> freedom to create nice layout, and because it is essentially a
> one-author document it's no problem that not everybody can afford
> FrameMaker (as long as I can generate HTML and PostScript, which I can
> -- and there's even a version of Frame that can generate SGML although
> I don't have it).  (Now that I've got a PC at home I may switch to MS
> Word too -- that's surely democratic :-)

There isn't a version of Frame that can generate SGML. There is a
version of Frame that can edit SGML. There is a subtle but important
difference. Once you start out in Frame *not* using Frame+SGML, there is
nothing that constrains you to using structures that have meaning in a
particular SGML DTD (including HTML). FrameMaker cannot thus imply
structure from your "nice layout".

I will be very curious to see how good the HTML output is, and how much
"freedom" Frame offers you without totally destroying the consistency of
your HTML output. If you use hot-pink on green to represent important
notes, how is that going to be represented in a document that makes
sense to Lynx? How will you know which FrameMaker features translate
properly into HTML and which do not? Trial and error?

Personally, I think you would be better off using Frame+SGML right off
the bat, because then you will have total control over the output, but I
will be curious to see what you get out of ordinary FrameMaker anyhow --
converting arbitrary MIF to HTML is sort of an AI project and I like to
see what's the state of the art in AI. :)
 
> Also the
> fact that SGML parsers that support the full syntax are either costly
> in money or in resources (few sites that I know have an SGML parser
> installed already; sgmllib.py doesn't cut it).  

I don't see how James Clark's SGML parser is expensive in either money
or resources. On Windows, it takes up about 3.5 MB with the Jade SGML
conversion tool, the OLE automation library, and 3 other related SGML
tools. It is trivial to install and compile. It is actually distributed
fairly widely as an HTML checker.

> TIM, on the other
> hand, was *designed* to be trivial to parse, so you can quickly write
> a small Python script that converts it to any format you like.

Great. But using Jade, I can convert to 3 formats (RTF, MIF, TeX,
PostScript) with a single "small script" (not Python, alas). If I do
want to use Python, my script will be just as simple, but will depend on
nsgmls. And as more formats arise, they will similarly be supported. But
more important -- I shouldn't have to write the small script at all,
because it is has already been written.

How does TIM enforce the proper organization of document macros. Will it
complain if I put an @messageDef{} inside of an @argDef{}? Doesn't this
type of enforcement seem useful in a situation where many people around
the world are working on a document?

 Paul Prescod

_______________
DOC-SIG  - SIG for the Python Documentation Project

send messages to: doc-sig@python.org
administrivia to: doc-sig-request@python.org
_______________