Regular Expressions

Mon Feb 12 12:17:14 EST 2007

    dbl> The source of HTMLParser and xmllib use regular expressions for
    dbl> parsing out the data. htmllib calls sgmllib at the begining of it's
    dbl> code--sgmllib starts off with a bunch of regular expressions used
    dbl> to parse data.

I am almost certain those modules use regular expressions for lexical
analysis (splitting the input byte stream into "words"), not for parsing
(extracting the structure of the "sentences").

If I have a simple expression:

    (7 + 3.14) * CONST

that's just a stream of bytes, "(", "&", " ", "+", ...  Lexical analysis
chunks that stream of bytes into the "words" of the language:

    LPAREN (NUMBER, 7) PLUS (NUMBER, 3.14) RPAREN TIMES (IDENT, "CONST")

Parsing then constructs a higher level representation of that stream of
"words" (more commonly called tokens or lexemes).  That representation is
application-dependent.

Regular expressions are ideal for lexical analysis.  They are not-so-hot for
parsing unless the grammar of the language being parsed is *extremely*
simple.

Here are a couple much better expositions on the topics:

    http://en.wikipedia.org/wiki/Lexical_analysis
    http://en.wikipedia.org/wiki/Parsing

Skip