ask for a RE pattern to match TABLE in html

Dan thermostat at gmail.com
Fri Jun 27 15:30:17 EDT 2008


On Jun 27, 1:32 pm, "David C. Ullrich" <dullr... at sprynet.com> wrote:
> In article
> <62f752f3-d840-42de-a414-0d56d15d7... at w4g2000prd.googlegroups.com>,
>  Jonathan Gardner <jgard... at jonathangardner.net> wrote:
>
> > On Jun 26, 3:22 pm, MRAB <goo... at mrabarnett.plus.com> wrote:
> > > Try something like:
>
> > > re.compile(r'<table\b.*?>.*?</table>', re.DOTALL)
>
> > So you would pick up strings like "<table><tr><td><table><tr><td>foo</
> > td></tr></table>"? I doubt that is what oyster wants.
>
> I asked a question recently - nobody answered, I think
> because they assumed it was just a rhetorical question:
>
> (i) It's true, isn't it, that it's impossible for the
> formal CS notion of "regular expression" to correctly
> parse nested open/close delimiters?

Yes. For the proof, you want to look at the pumping lemma found in
your favorite Theory of Computation textbook.

>
> (ii) The regexes in languages like Python and Perl include
> features that are not part of the formal CS notion of
> "regular expression". Do they include something that
> does allow parsing nested delimiters properly?

So, I think most of the extensions fall into syntactic sugar
(certainly all the character classes \b \s \w, etc). The ability to
look at input without consuming it is more than syntactic sugar, but
my intuition is that it could be pretty easily modeled by a
nondeterministic finite state machine, which is of equivalent power to
REs. The only thing I can really think of that is completely non-
regular is the \1 \2, etc syntax to match previously match strings
exactly. But since you can't to an arbitrary number of them, I don't
think its actually context free. (I'm not prepared to give a proof
either way). Needless to say that even if you could, it would be
highly impractical to match parentheses using those.

So, yeah, to match arbitrary nested delimiters, you need a real
context free parser.

>
> --
> David C. Ullrich


-Dan



More information about the Python-list mailing list