Is Python the Esperanto of programming languages?

Sat Mar 22 04:37:36 EST 2003

Steven Taschuk wrote:
   ...
> However, there's always *some* noise.  I think it plausible that
> there's enough noise normally (not so much in transmission as in
> utterance and interpretation) to require some degree of redundancy
> for error detection and correction.  How much, I don't know; it
> might well be less than any existing natural language actually has.

Heh -- here you're talking right to the kind of research I was
doing in IBM in the mid-80's, for the very specific purpose of
decreasing the error rate of a real-time dictation-taking system.

The approach was strictly Bayesian: given that our "ear model"
has detected the stream-of-phones S, we want to transcribe a
stream-of-words W such that P(W|S) is maximal.  BUT, we know that:

    P(W|S) = P(S|W) * P(W) / P(S)

and P(S) (the a priori probability of a stream of phones) is
irrelevant sicne we only care about the W giving maximum P()
for a GIVEN stream of phones.

So what we do care about are two terms:

    P(S|W) -- the probability that speaking a certain stream
              of words causes us to detect a certain stream of
              phones, our "ear model"

AND

    P(W)   -- the a priori probability that whoever is dictating 
              will speak a stream of words, a "language model"

The ear model is always affected by some kind of noise.  The
language model (particularly for Italian) is the part I worked on.

The key result: P(W) is far less affected by grammar than one
might think.  Most of the redundancies we were able to detect
and exploit are in semantics, pragmatics, and the usage of
formulaic language, idioms, and the like.  I don't know what's
happened in the field over the last 15+ years, but I would be
astonished if it turned out that grammar redundancies do instead
play an important role in this -- that would contradict a huge 
mass of results by our research groups, and by the time I left 
the field other groups were starting to work along very similar
lines (a purely statistical approach to language modeling) and
confirming the general lines of our early results.

BTW, most of my analysis was done with a high-level interpreted
scripting language (REXX, as we worked with IBM mainframes) and
a few auxiliary interpreter add-ons (that's why I learned BAL,
as at the time extending REXX meant writing assembly code) and
number crunching programs (Fortran and Pascal/VS) -- we managed
to get together a corpus of about 100 million words (huge for
that era) of office documents, articles from magazines and
newspapers and news services, books and the like, thanks to the
cooperation of many IBM customers who had been using computers
for typesetting and the like for a long time -- most of the work
was "reverse engineering" all sort of weird word-processing and
typesetting formats back into plain text to analyze...!-)

Alex