[Python-bugs-list] [ python-Bugs-476912 ] regex annoyance

noreply@sourceforge.net noreply@sourceforge.net
Wed, 31 Oct 2001 18:48:19 -0800


Bugs item #476912, was opened at 2001-10-31 12:17
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=476912&group_id=5470

Category: Regular Expressions
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Bill Bumgarner (bbum)
Assigned to: Fredrik Lundh (effbot)
Summary: regex annoyance

Initial Comment:
(this may be a feature request-- but it is annoying 
enough that I filed it as a bug)

Python's named sub expressions  within regular 
expressions are an incredibly valuable feature;  
between it and the ability to automatically collapse 
multiline regex's w/comments leads to very 
readable regex's.   

However, there is an annoyance in named 
subexpressions that has bitten me several times.

Namely, if you have a situation where a particular 
token must be parsed out of the input through the 
use of one of two (or more) expressions in a 
fashion that cannot be expressed without multiple 
possible means of matching any given 
subexpression, then the named subexpression 
will only be non-None intermittently (depending on 
expression order and what was matched).

That is, given:

(?:(?<Tok1>[a-z]+)\s(?<Tok2>[a-z]+))|(?:(?<Tok1>
[a-z]+)\t(?<Tok2>[a-z]+))

In this case, Tok1 and Tok2 will be None if the first 
expression matches... 

(Yes, this is a contrived example that could be 
refactored to not use multiple <Tok1>/<Tok2> 
references-- however, more complex expressions 
do not always enable easy refactoring.)

----------------------------------------------------------------------

>Comment By: Bill Bumgarner (bbum)
Date: 2001-10-31 18:48

Message:
Logged In: YES 
user_id=103811

While I agree that the proposed solution of raising an exception would certainly be more acceptable behavior than what is occurring now, doing away with support for multiple subexpressions with the same name would be undesirable.

In particular, named subexpressions allow the developer to decouple oneself from counting expressions.   It also allows the developer to not fall into a situation where they have to write a few lines of if/else statements to get the value when it might be in either expression A or expression B.

I would rather an error be raised if two separate instances of named expression A were both defined.   As long as only one matches, then it shouldn't matter that it appears twice.

The goal is to be able to do this|that where this and that both define the same set of named subexpressions.  By definition, only one of this or that will match and, therefore, only one value could be had for a named expression that appears in both this and that.

(As it stands, I have numerous lines of if/else 'this or that' code that generally causes clutter.  It means that the groupdict() cannot be treated as a pure result-- I often have to go through the this/that logic to normalize the groupdict into something that actually represents the results I desired).

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-10-31 18:07

Message:
Logged In: YES 
user_id=31435

Since symbolic names are names *of* integer group numbers, 
the regexp compiler should really raise an exception when 
seeing a given symbolic name defined more than once in a 
regexp.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=476912&group_id=5470