Regular expression bug?

Kurt Smith kwmsmith at gmail.com
Thu Feb 19 14:41:17 EST 2009


On Thu, Feb 19, 2009 at 12:55 PM, Ron Garret <rNOSPAMon at flownet.com> wrote:
> I'm trying to split a CamelCase string into its constituent components.
> This kind of works:
>
>>>> re.split('[a-z][A-Z]', 'fooBarBaz')
> ['fo', 'a', 'az']
>
> but it consumes the boundary characters.  To fix this I tried using
> lookahead and lookbehind patterns instead, but it doesn't work:
>
>>>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> ['fooBarBaz']
>
> However, it does seem to work with findall:
>
>>>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> ['', '']
>
> So the regular expression seems to be doing the Right Thing.  Is this a
> bug in re.split, or am I missing something?

>From what I can tell, re.split can't split on zero-length boundaries.
It needs something to split on, like str.split.  Is this a bug?
Possibly.  The docs for re.split say:

Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.

Note that it does not say that zero-length matches won't work.

I can work around the problem thusly:

re.sub(r'(?<=[a-z])(?=[A-Z])', '_', 'fooBarBaz').split('_')

Which is ugly.  I reckon you can use re.findall with a pattern that
matches the components and not the boundaries, but you have to take
care of the beginning and end as special cases.

Kurt



More information about the Python-list mailing list