Two RE proposals

Sat Jul 27 17:57:21 EDT 2002

hi david,

> > python already has string substitution.  if it needs better string
> > substitution, that should be solved outside the RE engine.
> 
> why? I'm not suggesting a general solution. I thought about suggesting it,
> but I figured there where probably more ! characters in general strings then
> in re strings. And, there is ample precedent for characters that have no
> special meaning outside of re strings. Oh yeah - and i'm not suggesting a
> modification to Python, i'm suggesting a modification to re-language.

you're suggesting a change to the RE language that will silently
break existing patterns.

> > besides, having library modules peek in your local namespace is
> > really bad style.
> 
> Damn - there goes inspect! I wonder what else displays bad style?
> Introspection/reflection considered harmful?

are you using inspect in production code?  seriously?

some of the problems I see with introspective interpolation include
performance issues, future portability (we want more implementations,
not less), and security issues.  and the "yet another syntax" problem,
of course.

> > and your proposal will break existing code.
> 
> Unsubstantiated. How can you make that assertion?

! and < and > have a meaning today.  if someone uses them, the RE
engine does things in a special way.  after your change, the RE engine
will (sometimes silently) interpret them in a different way.

breaking people's code is bad.  silently breaking code is even worse.

> Aside from which, has no Python enhancement ever broken existing
> code?

there's always a cost/benefit ratio to take into account.  as it
stands, I'd say your proposal has little benefit (python already
supports string interpolation) and some unknown cost (silent
breakage, yet another way to do it, adding non-standard syntax
to multiple RE implementations).

> > the following approach works in all existing versions of Python,
> > gives you syntax highlighting in all existing Python editors, etc:
> >
> >     def i(*args):
> >         return string.join(map(str, args))
> >
> >     word = r"\w*"
> >     punct = r"[,.;?]"
> >     wordpunct = re.compile(i(word, punct))
> >
> >     if = r"if"
> >     term = r"something"
> >     num = r"\d*"
> >     op = r"[-+*/]"
> >     factor = i(num, "\s*", op, "\s*", num)
> >     expr = i(term, factor)
> >     if_stmt = re.compile(i(if, "\s*\(?\s*", expr, "\s*\)?\s*:"))
> 
> Wow, you make Skip's example seem positively eloquent in contrast.

if you have a problem with lists of variables and "literals", separated
by commas, do you ever get anything done in Python? ;-)

> > if you're doing lots of RE stuff, you can trivially extend this to
> > support RE-oriented operations:
> >
> >     if = literal("if")
> >     op = set("-+*/")
> >     factor = seq(num, ws, op, ws, num)
> >
> > (google for "rxb" for a complete implementation of that idea)
> 
> looked - not impressed. Might be of interest to XSchema or Relax-NG people.
> Not very re-like.

intestingly enough, the rxb approach closely mirrors what the
engine does on the inside.  someone should really do a usability
study, comparing perl-style REs with rxb/snobol etc.

> > > 2. Make r"(a|b)*" mean any number of a's or b's.
> >
> > it does mean any number of a's or b's.  but no more than a
> > single a or b will end up in the group.
> 
> Huh? Actually, re rejects the pattern, or if you try hard enough, goes
> into an infinite loop.

>>> import re
>>> p = re.compile("(a|b)*")                                                    
>>> p.match("ababa").groups()
('a',)
>>> p.match("babab").groups()
('b',)
>>> p = re.compile("((?:a|b)*)")                                                 
>>> p.match("ababa").groups()                                                    
('ababa',)
>>> p.match("babab").groups()                                                    
('babab',)

does exactly what it's supposed to do.

what Python version are you using, and what do your patterns
and target strings look like?

infinite loops are usually caused by careless use of the "*" operator;
in a worst case scenario, you'll end up with what is essentially a
bunch of nested loops, each looping over "the rest of the string".

> > fixing the RE is done in a similar fashion: make sure the group
> > matches everything you want to put in the group:
> >
> >     r"((?:(a|b)*)"
> >
> > if you want lists of matching things, use findall.
> 
> So hard to embed findall into an re pattern - what's your secret?

if your string isn't regular enough, use findall on the matched
group:

    p = re.compile("((?:a|b)*)")                                                 
    >>> p.match("ababa").groups()                                                    
    ('ababa',)
    >>> re.findall("(a|b)", "ababa")                                                          
    ['a', 'b', 'a', 'b', 'a']

or if you want to play with undocumented features, use a scanner
object:

    http://effbot.org/guides/xml-scanner.htm

> What do I do if I want a better Python? Do we wait for specific people to
> make suggestions or can anyone join in?

keep pushing.  but expect pushback.

in this case, I suggest checking the string interpolation PEPs and
perhaps checking the newsgroup archives for similar ideas, and
spend a little more time thinking about how to avoid breakage.

the next step is writing a (pre) PEP and post it to the list.  bonus
points if the PEP covers alternative solutions as well.

</F>