[DOC-SIG] What I don't like about SGML

Guido van Rossum guido@CNRI.Reston.Va.US
Sun, 16 Nov 1997 10:54:00 -0500


Here's the background of my dislike for SGML.  To confine this
highly flammable material :-), I'm spawning another thread.

First, while SGML may have been standardized in the swinging '80s, it
definitely has its roots in the '70s -- it takes many years to become
an international standard (look at C++!), and it started its life, as
"GML", long before standardization started.  Undoubtedly some of the
worse features in SGML were designed to be backwards compatible
(again, very much like C++...).

I am well aware that HTML is SGML conformant since HTML 2.0, and this
is precisely the reason for my concern.

99.9% of the time, HTML is parsed by relatively simple handwritten
parsers, not by generic SGML scanners.  There are lots of programs out
there that have to parse HTML -- preprocessors, web browsers, web
spiders, etc.  Why don't these just link to an existing SGML scanner?
Because SGML scanners are *huge*.  They need to be big to scan generic
SGML, which is a very complex language.  But most of this power isn't
needed to scan HTML, so people roll their own parser.

Before HTML had a version number, I wrote an HTML scanner in Python.
It was very simple.  Look for < or </ followed by a letter, then scan
up to a > character, etc.  HTML was simple to scan by design: Tim
Berners-Lee wanted HTML and HTTP to be so simple that almost anybody
could write programs that would immediately interoperate with the rest
of the web as it then existed.  There is no doubt that this is the
reason that the web took off at all.

But Berners-Lee made one mistake: he made HTML look a bit like SGML
(which he had seen once or twice from a distance :-).  Almost
immediately HTML was targeted by the SGML lobby for full compliance.
Here's what was added; all of this made my parser much more
complicated than I think it ought to be (look at how complicated
sgmllib.py is).  Note that most of what was added doesn't add
functionality.  In one or two cases it even takes away functionality!
It just complicates the scanning process in order to be compatible
with the extremely complicated scanning rules designed for SGML on
punched cards in the 70s.

    - A second special character '&' for entity references (original HTML
    used <lt> to escape "<").

    - Character references like &#32; or &#SPACE;.

    - Comments in the form of <!--.....-->, truly the most atrocious
    comment convention invented (and I believe it's worse -- officially,
    "--" may not occur inside a comment but "-- --" may, or something like
    that; but who cares, as almost no handwritten parser seems to get this
    right).

    - Special stuff to be ignored, starting with <!...>, where it is
    tricky to determine what the end is (since sometimes "<" or ">" may
    occur inside.

    - Special stuff to be ignored, starting with <?...>.

    - Short tags, <word/.../, which are still mostly outlawed because of
    compatibility reasons with older HTML processors, but which have to be
    recognized if you want to clame the elusive "full compliance".

    - It is not possible to turn off processing completely.  There used to
    be an HTML tag <LISTING> (?) which switched to literal copying of the
    text until </LISTING> was found.  This is impossible to do in SGML --
    the best you can do is to switch to literal mode until </ followed by
    a letter is seen, and you can't turn off &ref; processing either.
    Of course, with a handwritten parser it is no problem to switch to a
    mode that scans for </LISTING> exclusively...

    - Why do I have to put quotes around the URL in <A
    HREF="http://www.python.org"> ???

    - Other restrictions on what you can do with attributes; apparently
    there's a semantic rule that says that if two unrelated tags have an
    attribute with the same name, it must have the same "type".

    - A content model, which nobody asked for, and which few people check
    for, but which still allows HTML purists to tell you that your HTML
    page is "non-conformant" when you place an <H4> heading inside a <LI>
    list item (okay, so I made that up).

    - Probably a few other things that nobody asked for, such as the
    DTD declaration and SGML's approach to character sets (which is
    probably broken -- I believe there is a way to switch character
    sets in mid-stream...).

Of course, SGML aficionados will claim that all this was necessary so
that HTML could be processed with SGML, the most powerful and flexible
test processing mechanism available.  However, 99% of all HTML written
will never be processed by SGML; it is intended for throw-away
content.  Serious SGML users have two other recourses available to
them:

(1) Write everything in SGML and generate HTML from that; I believe
Jade can do this.

(2) Write a simple HTML scanner and convert it to SGML, by hook or by
crook.  I believe this is being done too.

So my claim remains that the requirement of SGML conformance is for
99% just a nuisance for parser writers.  Of course I'm biased, since
I'm a parser writer myself...  So see for yourself what you think of
this argument.

--Guido van Rossum (home page: http://www.python.org/~guido/)

_______________
DOC-SIG  - SIG for the Python Documentation Project

send messages to: doc-sig@python.org
administrivia to: doc-sig-request@python.org
_______________