[Python-Dev] OT: XML

Ka-Ping Yee ping@lfw.org
Sun, 16 Apr 2000 18:54:41 -0700 (PDT)


I'll begin with my conclusion, so you can get the high-order bit
and skip the rest of the message if you like:

    XML is useful, but it's not a language.


On Thu, 13 Apr 2000, Paul Prescod wrote:
> 
> What definition of "language" are you using? And while you're at it,
> what definition of "semantics" are you using?
> 
> As I recall, a string is an ordered list of symbols and a language is an
> unordered set of strings.

I use the word "language" to mean an expression medium that carries
semantics at a usefully high level.  The computer-science definition
you gave would include, say, the set of all pairs of integers
(not a language to most people), but not include classical music [1]
(indeed a language to many people).

I admit that the boundary of "enough" semantics to qualify as a
language is fuzzy, but some things do seem quite clearly to fall
on either side of the line for me.  For example, saying that XML
has semantics is roughly equivalent to saying that ASCII has
semantics.  Well, sure, i suppose 65 has the "semantics" of the
uppercase letter A, but that's not semantics at any level high
enough to be useful.

That is why most people would probably not normally call ASCII a
"language".  It has to express something to be a language.  Granted,
you can pick nits and say that XML has semantics as you did, but
to me that essentially amounts to calling the syntax the semantics.

> I know that Ka-Ping, despite going to a great university was in
> Engineering, not computer science

Cute.  (I'm glad i was in engineering; at least we got a little
design and software engineering background there, and i didn't
see much of that in CS, unfortunately.)

> Most XML people will happily admit that XML has no "semantics" but I
> think that's bullshit too.  The mapping from the string to the abstract
> tree data model *is the semantic content* of the XML specification.

Okay, fine.  Technically, it has semantics; they're just very
minimal semantics (so minimal that i felt quite comfortable in
saying that it has none).  But that doesn't change my point -- for
"it has no semantics and therefore doesn't qualify as a language"
just read "it has far too minimal semantics to qualify as a language".

> It makes as little sense to reject XML out of hand because it is a
> buzzword but is not innovative as it does for people to embrace it
> mystically because it is Microsoft's flavor of the week.

Before you get the wrong impression, i don't intend to reject XML
out of hand, or to suggest that people do.  It has its uses, just
as ASCII has its uses.  As a way of serializing trees, it's quite
acceptable.  I am, however, reacting to the sudden onslaught of
hype that gives people the impression that XML can do anything.
It's this sort of attitude that "oh, all of our representation
problems will go away if we throw XML at it" that makes me cringe;
that's only avoiding the issue.  (I'm not saying that you are this
clueless, Paul! -- just that some people seem to be.)

As long as we recognize XML as exactly what it is, no more and no
less -- a generic mechanism for serializing trees, with associated
technologies for manipulating those trees -- there's no problem.

> By the way, what data model or text encoding is NOT isomorphic to Lisp
> S-expressions? Isn't Python code isomorphic to Lisp s-expessions?

No!  You can run Python code.  The code itself, of course, can
be interpreted as a stream of bytes, or arranged into a tree of
LISP s-expressions.  But if s-expressions that were *all* that
constituted Python, Python would be pretty useless indeed!  The
entity we call Python includes real content: the rules for
deriving the expected behaviour of a Python program from its
parse tree, as variously specified in the reference manual,
the library manual, and in our heads.

LISP itself is a great deal more than just s-expressions.  The
language system specifies the behaviour you expect from a given
piece of LISP code, and *that* is the part i call semantics.


"real" semantics:         Python   LISP   English    MIDI

minimal or no semantics:  ASCII    lists  alphabet   bytes


The things in the top row are generally referred to as
"languages"; the things in the bottom row are not.  Although
each thing in the top row is constructed from its corresponding
thing in the bottom row, the difference between the two is what
i am calling "semantics".  If the top row says A and the bottom
row says B, you can look at the B-type things that constitute
the A and say, "if you see this particular B, it means foo".

XML belongs in the bottom row, not the top row.

Python: "If you see 'a = 3' in a function, it means you take
the integer object 3 and bind it to the name 'a' in the local
namespace."

XML: "If you see the tag <spam eggs=boiled>, it means...
well... uh, nothing.  Sorry.  But you do get to decide that
'spam' and 'eggs' and 'boiled' mean whatever you want."

That is why i am unhappy with XML being referred to as a
"language": it is a misleading label that encourages people
to make the mistake of imagining that XML has more semantic
power than it really does.

Why is this a fatal mistake?  Because using XML will no more
solve your information interchange problems than writing
Japanese using the Roman alphabet will suddenly cause English
speakers to be able to read Japanese novels.  It may *help*,
but there's a lot more to it than serialization.

Thus: XML is useful, but it's not a language.

And, since that reasonably summarizes my views on the issue,
i'll say no more on this topic on the python-dev list -- any
further blabbing i'll do in private e-mail.


-- ?!ng

"In the sciences, we are now uniquely privileged to sit side by side
with the giants on whose shoulders we stand."
    -- Gerald Holton


[1] I anticipate an objection such as "but you can encode
a piece of classical music as accurately as you like as a
sequence of symbols."  But the music itself doesn't fit the
Chomskian definition of "language" until you add that
symbolic mapping and the rules to arrange those symbols in
sequence.  At that point the thing you've just added *is*
the language: it's the mapping from symbols to the semantics
of e.g. "and at time 5.36 seconds the first violinist will
play an A-flat at medium volume".