Xah's Edu Corner: The importance of syntax & notations.

Mon Aug 17 01:22:44 EDT 2009

In comp.lang.scheme toby <toby at telegraphics.com.au> wrote:
> In my opinion Knuth believed in the value of literate programming for
> similar reasons: To try to exploit existing cognitive training. If
> your user base is familiar with English, or mathematical notation, or
> some other lexicography, try to exploit the pre-wired associations.
> Clearly this involves some intuition.

I've dabbled in the linguistics field a little bit and this thread made
me remember a certain topic delineated in this paper:

http://complex.upf.es/~ricard/SWPRS.pdf

This paper uses a concept that I would say needs more investigation:

The mapping of syntax onto metric spaces.

In fact, the paper literally makes a statement about how physically
close two words are when mapped onto a line of text and language networks
build graph topologies based upon the distance of words from each other
as they relate in syntactical form.

It seems the foundations for being able to construct a metric to "measure"
syntax is somewhat available. If you squint at the math in the above
paper, you can see how it can apply to multidimensional metric spaces
and syntax in different modalities (like the comparison between
reading written words and hearing spoken words).

For example, here the same semantic idea in a regex defined three ways:

english-like:

Match the letter "a" followed by one or more letter "p"s followed by the 
letter "l" then optionaly the letter "e".

scheme-like:

(regex (char-class \#a) (re+ (char-class \#p)) (char-class \#l)
       (re? (char-class \#e)))

perl-like:

/ap+le?/

The perl-like one would be the one chosen by most programmers, but the
various reasons why it is chosen will fluctuate. Can we do better? Can
we give a predictable measurement of some syntactic quantities that
we can optimize and get a predictable answer? 

Most people would say the perl-like form has the least amount of
"syntactic garbage" and is "short". How do we meaningfully define those
two terms? One might say "syntactic garbage" is syntax which doesn't
relate *at all* to the actual semantic objects, and "short" might mean
elimination of redundant syntax and/or transformation of explicit syntax
into implicit syntax already available in the metric space.

What do I mean by explicit versus implicit? 

In the scheme-like form, we explicitly denote the evaluation and class
of the various semantic objects of the regex before applying the "regex"
function across the evaluated arguments. Whitespace is used to separate
the morphemes of only the scheme syntax. The embedding metric space
of the syntax (meaning the line of text indexed by character position)
does nothing to help or hinder the expression of the semantic objects.

In the perl-like form, we implicitly denote the evaluation of the regex
by using a prototyping-based syntax, meaning the inherent qualities of
the embedding metric space are utilized. This specifically means that
evaluation happens left to right in reading order and semantic objects
evaluate directly to themselves and take as arguments semantic objects as
related directly in the embedding metric space. For example, the ? takes
as an argument the semantic object at location index[?]-1. Grouping
parenthsis act as a VERY simple tokenization system to group multiple
objects into one syntactical datum. Given the analysis, it seems the
perl-like regex generally auto-quote themselves as an advantageous use
of the embeded metric space in which they reside (aka the line of text).

The english-like form is the worst for explicit modeling because it
abstracts the semantic objects into a meta-space that is then referenced
by the syntax in the embedding space of the line itself.  In human
reasoning, that is what the quotes mean.

Out of the three syntax models, only the perl one has no redundant syntax
and takes up the smallest amount of the metric space into which it is
embedded--clearly seen by observation.

Consider if we have quantities "syntactic datums" versus "semantic objects",
then one might make an equation like this:

                 semantic objects
expressiveness = ----------------
                 syntactic datums

And of course, expressiveness rises as semantic objects begin to outweigh
syntactic objects. This seems a very reasonable, although simplistic,
model to me and would be a good start in my estimation. I say simplistic
because the semantic objects are not taken in relation to themselves
on the metric space the syntactic datums are embeded, I'll get to this
in a bit.

Now, what do we do with the above equation, well, we define what we can, and
then optimize the hell out of it. "semantic objects" for a computer
language is probably a fixed quantity, there are only so many operators and
grouping constructs, and usually very few, meaning one, function description
semantic objects. So, we are left with a free variable of "syntactic datums"
that we should minimize to be as small as possible.

If we don't take into consideration the topological mapping of
the semantic objects into the syntactic datum space, then the
number of semantic objects at least equals the number of syntactic
datums. Expressiveness is simply one in the best case, and this would
be pretty poor for sure. It could be worse--just add redundant syntactic
datums, now it is worse!

If we take into consideration the mapping of the semantic objects onto the
metric space, then maybe this becomes the new equation:

                 distance(index[obj1], index[obj2], ..., index[objn])
expressiveness = ----------------------------------------------------
                 syntactic datums

The distance() function in this new model is the centroid of the syntactic
datum which represent the semantic object. (Of course, there are other
models of expressiveness which would need to be explored based upon this
incremental idea I'm presenting.)

Larger distances between semantic objects, or larger syntactic constructs,
would mean less expressiveness.  This is interesting, because it can give
a rough number that sorts the three syntactic models I provided for the
regex and it would follow conventional wisdom. This is a decriptivist
model of syntax measurement.

Obviously, the fun is can we search an optimization tree to try find an
optimal set of syntactic datums to represent a known and finite set of
semantic objects taking into consideration the features of the metric
space into which the syntactic datums are embedded?

Maybe for another day.... 

Later,
-pete