Stackless & String-processing

Sat Jul 17 00:20:38 EDT 1999

In article <7mnirc$t2b$2 at cronkite.cc.uga.edu>, Graham Matthews
<graham at sloth.math.uga.edu> wrote:

> Fernando Pereira (pereira at research.att.com) wrote:
> : but, as far as I know, nothing beats
> : the automata-theoretic optimization techniques for pattern matching and
> : parsing, of which the typical modern compiler for regular expressions
> : is just the best-known instance.
> 
> In what sense are you using the words "nothing beats". I am sure that
> the automata-theoretic approach produces the fastest pattern matchers
> and parsers. So on this sense "nothing beats them".
I made the sense clear. For regular and deterministic CF languages,
although my main use of finite-state techniques these days is in speech
and text processing, involving some less known and some new techniques
for combining and optimizing weighted finite-state transducers
<http://www.research.att.com/sw/tools/fsm/>. 
> But the class of 
> languages recognisable by automaton doesn't cover all the languages of
> interest (in fact my guess is it covers very few ... context ...).
In what sense of "very few"? Most programming languages are designed to
have an LR(k) or LL(k) backbone, and regular lexical structure (those
that aren't are cursed by their unhappy implementors). {S|HT|X}ML and
other Internet protocols obey similar or even stronger restrictions as
far as I know.

Naturally occurring sequences (text, speech, proteins, DNA, ...) do not
fit neatly into the standard formal language classes, and for those the
question is that of what language class can best approximate the string
distribution, which pushes one quickly to very restrictive language
classes (eg. bounded-context regular languages, or lexicalized
stochastic CFGs) learnable automatically from data. 
> So
> in this sense lots of things "beat" the automata approach.
Opportunities for optimization => efficiency. Crucial in text and
speech indexing, Web search, biological sequence processing, event log
analysis, to name just a few.
> And then
> there are software engineering issues ...
Agreed. However, more traditional lexer- and parser-generators had
sensible answers to many of those issues, that might not be as elegant
as monadic approaches, but that paid serious attention to optimization
in the regular and CF backbones. By the way, I find it faintly amusing
to see the functional programming community rediscovering and dressing
in categorial garb a simple idea that was a main motivator for the
development of logic programming in the early 70s (see
<http://akpublic.research.att.com:9000/~pereira/bib.html#Pereira+Warren-
80:DCGs> for a survey). It *is* a good idea for rapid prototyping, but
interspersing pattern matching/parsing with arbitrary computations
(even purely functional ones) effectively blocks optimization, which is
required in large-scale applications.

-- F