Splitting a sequence into pieces with identical elements

Tim Chase python.list at tim.thechases.com
Tue Aug 10 22:31:09 EDT 2010


On 08/10/10 20:30, MRAB wrote:
> Tim Chase wrote:
>>    r = re.compile(r'((.)\1*)')
>>    #r = re.compile(r'((\w)\1*)')
>
> That should be \2, not \1.
>
> Alternatively:
>
>       r = re.compile(r'(.)\1*')

Doh, I had played with both and mis-transcribed the combination 
of them into one malfunctioning regexp.  My original trouble with 
the 2nd one was that r.findall() (not .finditer) was only 
returning the first letter of each because that's what was 
matched.  Wrapping it in the extra set of parens and using "\2" 
returned the actual data in sub-tuples:

 >>> s = 'spppammmmegggssss'
 >>> import re
 >>> r = re.compile(r'(.)\1*')
 >>> r.findall(s) # no repeated text, just the initial letter
['s', 'p', 'a', 'm', 'e', 'g', 's']
 >>> [m.group(0) for m in r.finditer(s)]
['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
 >>> r = re.compile(r'((.)\2*)')
 >>> r.findall(s)
[('s', 's'), ('ppp', 'p'), ('a', 'a'), ('mmmm', 'm'), ('e', 'e'), 
('ggg', 'g'), ('ssss', 's')]
 >>> [m.group(0) for m in r.finditer(s)]
['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

By then changing to .finditer() it made them both work the way I 
wanted.

Thanks for catching my mistranscription.

-tkc






More information about the Python-list mailing list