Question concerning this list [WebCrawler]

Thomas Ploch Thomas.Ploch at gmx.net
Sun Dec 31 13:16:18 EST 2006


John Nagle schrieb:
> 
>     Very true.  HTML is LALR(0), that is, you can parse it without
> looking ahead.  Parsers for LALR(0) languages are easy, and
> work by repeatedly getting the next character and using that to
> drive a single state machine.  The first character-level parser
> yields tokens, which are then processed by a grammar-level parser.
> Any compiler book will cover this.
> 
>     Using regular expressions for LALR(0) parsing is a vice inherited
>>from Perl, in which regular expressions are easy and "get next
> character from string" is unreasonably expensive.  In Python, at least
> you can index through a string.
> 
> 				John Nagle

I take it with LALR(0) you mean that HTML is a language created by a
Chomsky-0 (regular language) Grammar?

Thomas



More information about the Python-list mailing list