Question concerning this list [WebCrawler]

Diez B. Roggisch deets at nospam.web.de
Mon Jan 1 15:08:51 EST 2007


Thomas Ploch schrieb:
> John Nagle schrieb:
>>     Very true.  HTML is LALR(0), that is, you can parse it without
>> looking ahead.  Parsers for LALR(0) languages are easy, and
>> work by repeatedly getting the next character and using that to
>> drive a single state machine.  The first character-level parser
>> yields tokens, which are then processed by a grammar-level parser.
>> Any compiler book will cover this.
>>
>>     Using regular expressions for LALR(0) parsing is a vice inherited
>> >from Perl, in which regular expressions are easy and "get next
>> character from string" is unreasonably expensive.  In Python, at least
>> you can index through a string.
>>
>> 				John Nagle
> 
> I take it with LALR(0) you mean that HTML is a language created by a
> Chomsky-0 (regular language) Grammar?

Nope.

LALR is a context free grammar parsing technique.

Regular expressions can't express languages like

a^n b^n

but something like

<div><div></div></div>

is <div>^2</div>^2

Diez



More information about the Python-list mailing list