[Tutor] Re: parsing--is this right?

Derrick 'dman' Hudson dman@dman.ddts.net
Mon, 10 Jun 2002 20:52:54 -0500


--2oS5YaxWCcQjTEyO
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Jun 10, 2002 at 03:02:37PM -0700, Danny Yoo wrote:
| On Mon, 10 Jun 2002, Paul Tremblay wrote:
|=20
| > I have just stumbled across the concept of parsing and parsing
| > grammars, and wondered if I am using the right tool.
| >
| > I haved downloaded and installed plex in order to parse a rtf
| > document. The rtf looks like this:
| >
| > {\footnote {\i an italicized word} {\i maybe another italicized
| > word} text }
=20
| > However, I am wondering if plex does things the wrong way.
|=20
| Plex is a tool for breaking the text into "tokens" that are easier to look
| at.  However, it's not enough.  You'll probably want to use a parsing
| technique like recursive descent, or use a specialized parser-building
| tool.  We can talk about it more if you'd like.
=20
Isn't there some sort of lex/yacc clone for python?

lex (or flex, if you use the GNU version) is a lexical analyzer
generator for C.  It is often used in conjuction with yacc (or bision,
if you use the GNU version) which is a "compiler compiler" (for C).
(yacc =3D=3D Yet Another Compiler Compiler)

The combination of lex and yacc allows rapid development of a flexible
and robust parser for your C program.  With lex you simply specify
regex patterns for identifying the tokens (I think plex is supposed to
do the same sort of thing), and those tokens are passed into the
"compiler" that yacc generates.  You tell yacc what the EBNF (aka CFG,
Context-Free Grammar) grammar is of your language and it generates the
necessary C code to recognize it from the tokens lex passes it.  It's
somewhat complicated to try and explain with no prior background, but
it's a really neat setup.

In one lab, in a matter of hours, a friend and I implemented a
calculator using lex and yacc.  That calculator was quite flexible,
allowed whitespace in various ways (like real tools such as the C
compiler or python allow) and allowd C-style comments.  Previously I
had implemented a similar tool in C++ with a larger group and we spent
weeks on it.  It had a much stricter use of whitespace because the
parser was all hand-coded and didn't use regex at all.  The difference
between hand-coding your own parser from scratch and using a generated
one is significant.

If I were you, I would try to find an existing tool if you can.  Look
for an RTF parser (though I don't think you're likely to find a decent
one unless you disassemble MS Word) or a parser generator like
lex/yacc.

HTH,
-D


--=20

The wise in heart are called discerning,
and pleasant words promote instruction.
        Proverbs 16:21
=20
GnuPG key : http://dman.ddts.net/~dman/public_key.gpg


--2oS5YaxWCcQjTEyO
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAj0FV/UACgkQO8l8XBKTpRQ3yACfY9yjD1yCgHWYadT/TI91xYOq
gWUAn18BCF5dcN5wwE4XlookheQh230F
=pwPL
-----END PGP SIGNATURE-----

--2oS5YaxWCcQjTEyO--