Regular expression bug?

Thu Feb 19 16:02:39 EST 2009

In article <mailman.281.1235073821.11746.python-list at python.org>,
 MRAB <google at mrabarnett.plus.com> wrote:

> Ron Garret wrote:
> > I'm trying to split a CamelCase string into its constituent components.  
> > This kind of works:
> > 
> >>>> re.split('[a-z][A-Z]', 'fooBarBaz')
> > ['fo', 'a', 'az']
> > 
> > but it consumes the boundary characters.  To fix this I tried using 
> > lookahead and lookbehind patterns instead, but it doesn't work:
> > 
> >>>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> > ['fooBarBaz']
> > 
> > However, it does seem to work with findall:
> > 
> >>>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> > ['', '']
> > 
> > So the regular expression seems to be doing the Right Thing.  Is this a 
> > bug in re.split, or am I missing something?
> > 
> > (BTW, I tried looking at the source code for the re module, but I could 
> > not find the relevant code.  re.split calls sre_compile.compile().split, 
> > but the string 'split' does not appear in sre_compile.py.  So where does 
> > this method come from?)
> > 
> > I'm using Python2.5.
> > 
> I, amongst others, think it's a bug (or 'misfeature'); Guido thinks it
> might be intentional, but changing it could break some existing code.

That seems unlikely.  It would only break where people had code invoking 
re.split on empty matches, which at the moment is essentially a no-op.  
It's hard to imagine there's a lot of code like that around.  What would 
be the point?

> You could do this instead:
> 
>  >>> re.sub('(?<=[a-z])(?=[A-Z])', '@', 'fooBarBaz').split('@')
> ['foo', 'Bar', 'Baz']

Blech!  ;-)  But thanks for the suggestion.

rg