Splitting a sequence into pieces with identical elements

Tue Aug 10 22:31:09 EDT 2010

On 08/10/10 20:30, MRAB wrote:
> Tim Chase wrote:
>>    r = re.compile(r'((.)\1*)')
>>    #r = re.compile(r'((\w)\1*)')
>
> That should be \2, not \1.
>
> Alternatively:
>
>       r = re.compile(r'(.)\1*')

Doh, I had played with both and mis-transcribed the combination 
of them into one malfunctioning regexp.  My original trouble with 
the 2nd one was that r.findall() (not .finditer) was only 
returning the first letter of each because that's what was 
matched.  Wrapping it in the extra set of parens and using "\2" 
returned the actual data in sub-tuples:

 >>> s = 'spppammmmegggssss'
 >>> import re
 >>> r = re.compile(r'(.)\1*')
 >>> r.findall(s) # no repeated text, just the initial letter
['s', 'p', 'a', 'm', 'e', 'g', 's']
 >>> [m.group(0) for m in r.finditer(s)]
['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
 >>> r = re.compile(r'((.)\2*)')
 >>> r.findall(s)
[('s', 's'), ('ppp', 'p'), ('a', 'a'), ('mmmm', 'm'), ('e', 'e'), 
('ggg', 'g'), ('ssss', 's')]
 >>> [m.group(0) for m in r.finditer(s)]
['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

By then changing to .finditer() it made them both work the way I 
wanted.

Thanks for catching my mistranscription.

-tkc