Question concerning this list [WebCrawler]

Sun Dec 31 12:44:06 EST 2006

Thomas Ploch wrote:
> Marc 'BlackJack' Rintsch schrieb:
> 
>>In <mailman.2166.1167535289.32031.python-list at python.org>, Thomas Ploch
>>wrote:

>>>Alright, my prof said '... to process documents written in structural
>>>markup languages using regular expressions is a no-no.'

    Very true.  HTML is LALR(0), that is, you can parse it without
looking ahead.  Parsers for LALR(0) languages are easy, and
work by repeatedly getting the next character and using that to
drive a single state machine.  The first character-level parser
yields tokens, which are then processed by a grammar-level parser.
Any compiler book will cover this.

    Using regular expressions for LALR(0) parsing is a vice inherited
from Perl, in which regular expressions are easy and "get next
character from string" is unreasonably expensive.  In Python, at least
you can index through a string.

				John Nagle