regex help: splitting string gets weird groups

Patrick Maupin pmaupin at gmail.com
Thu Apr 8 15:46:01 EDT 2010


On Apr 8, 1:49 pm, gry <georgeryo... at gmail.com> wrote:
> [ python3.1.1, re.__version__='2.2.1' ]
> I'm trying to use re to split a string into (any number of) pieces of
> these kinds:
> 1) contiguous runs of letters
> 2) contiguous runs of digits
> 3) single other characters
>
> e.g.   555tHe-rain.in#=1234   should give:   [555, 'tHe', '-', 'rain',
> '.', 'in', '#', '=', 1234]
> I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()
>
> ('1234', 'in', '1234', '=')
>
> Why is 1234 repeated in two groups?  and why doesn't "tHe" appear as a
> group?  Is my regexp illegal somehow and confusing the engine?
>
> I *would* like to understand what's wrong with this regex, though if
> someone has a neat other way to do the above task, I'm also interested
> in suggestions.

IMO, for most purposes, for people who don't want to become re
experts, the easiest, fastest, best, most predictable way to use re is
re.split.  You can either call re.split directly, or, if you are going
to be splitting on the same pattern over and over, compile the pattern
and grab its split method.  Use a *single* capture group in the
pattern, that covers the *whole* pattern.  In the case of your example
data:

>>> import re
>>> splitter=re.compile('([A-Za-z]+|[0-9]+|[-.#=])').split
>>> s='555tHe-rain.in#=1234'
>>> [x for x in splitter(s) if x]
['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

The reason for the list comprehension is that re.split will always
return a non-matching string between matches.  Sometimes this is
useful even when it is a null string (see recent discussion in the
group about splitting digits out of a string), but if you don't care
to see null (empty) strings, this comprehension will remove them.

The reason for a single capture group that covers the whole pattern is
that it is much easier to reason about the output.  The split will
give you all your data, in order, e.g.

>>> ''.join(splitter(s)) == s
True

HTH,
Pat



More information about the Python-list mailing list