[Tutor] Help with re.sub()

Fri Mar 17 05:54:38 CET 2006

John Clark wrote:
> Hi,
> 
> I have a file that is a long list of records (roughly) in the format
> 
> objid at objdata
> 
> So, for example:
> 
> id1 at data1
> id1 at data2
> id1 at data3
> id1 at data4
> id2 at data1
> ....
> 
> What I would like to do is run a regular expression against this and
> wind up with:
> 
> id1 at data1@data2 at data3@data4
> id2 at data1

Regular expressions aren't so good at dealing with repeating data like 
this. OTOH itertools.groupby() is perfect for this:

# This represents your original data
data = '''id1 at data1
id1 at data2
id1 at data3
id1 at data4
id2 at data1
id2 at data5'''.splitlines()

# Convert to a list of pairs of (id, data)
data = [ line.split('@') for line in data ]

from itertools import groupby
from operator import itemgetter

# groupby() will group them according to whatever key we specify
# itemgetter(0) will pull out just the first item
# the result of groupby() is a list of (key, sequence of items)
for id, items in groupby(data, itemgetter(0)):
     print '%s@%s' % (id, '@'.join(item[1] for item in items))

I have a longer explanation of groupby() and itemgetter() here:
http://www.pycs.net/users/0000323/weblog/2005/12/06.html

> So, my questions are:
> (1) Is there any way to get a single regular expression to handle
> overlapping matches so that I get what I want in one call?

I doubt it though I'd be happy to be proven wrong ;)

> (2) Is there any way (without comparing the before and after strings) to
> know if a re.sub(...) call did anything?

Use re.subn() instead, it returns the new string and a count.

Kent