Find offsets and lengths from re

Wed Oct 30 20:56:21 EST 2002

On Wed, 30 Oct 2002 22:29:20 +0000 (UTC), huaiyu at gauss.almadan.ibm.com (Huaiyu Zhu) wrote:

>Bengt Richter <bokr at oz.net> wrote:
>>On Tue, 29 Oct 2002 19:00:00 +0000 (UTC), huaiyu at gauss.almadan.ibm.com (Huaiyu Zhu) wrote:
>>
>>>Given a list of substrings, how to efficiently return a list of offsets and
>>>lengths of all occurances of these substrings in a piece of text?
>>>
>>>I can use string find, looping over substrings and locations.  But this is
>>>too slow for a large number of texts.
>>>
>>>The speed of re.findall appears adequate, but I have not found a way to let
>>>re return offsets.  What it returns is a list of substrings that matches.  I
>>>can use this list in a loop of string.find.  This reduces the original
>>>double loop into a single loop.   This still takes quite some time.
>>>
>>>It would be more efficient if re can return offsets directly.  Is there a
>>>way to do that?  
>>>
>>Does this help?
>>
>>http://www.python.org/doc/current/lib/match-objects.html
>
>Thanks.  This helps somewhat.  I assume you mean using finditer and loop
>over the the match objects.  This reduces the total running time of the
>program from 8m40s to 7m31s, the bulk of its time spent on finding these
>positions.  If there is a way to return positions without using a Python for
>loop, I guess the running time can be halved.   I'm willing to patch the re
>module if someone can point me to the right direction.

Did you try something like:

 >>> s = 'Example with xx and yy at several offsets yy xxxx and so forth.'
 >>> import re
 >>> rx = re.compile(r'(xx|yy|with)')
 >>> offs = [(m.start(), m.end()) for m in rx.finditer(s)]

Checking on it:

 >>> for i,j in offs: print 's[%2s:%2s] = %s' % (i, j, s[i:j])
 ...
 s[ 8:12] = with
 s[13:15] = xx
 s[20:22] = yy
 s[42:44] = yy
 s[45:47] = xx
 s[47:49] = xx
 >>> print '%s\n%s' % (s, ' 123456789'*7)
 Example with xx and yy at several offsets yy xxxx and so forth.
  123456789 123456789 123456789 123456789 123456789 123456789 123456789

And it was too slow? BTW, did you need the end offset as well as the start?

I wonder if splitting and counting yourself would go as fast?

I.e., map(len, rx.split(s)) would give you a list of the lengths of all the pieces,
and the odd ones should be the matches.

 >>> map(len,rx.split(s))
 [8, 4, 1, 2, 5, 2, 20, 2, 1, 2, 0, 2, 14]

 Or you could do

 >>> reduce(lambda offs, ss: offs+[offs[-1]+len(ss)],rx.split(s),[0])
 [0, 8, 12, 13, 15, 20, 22, 42, 44, 45, 47, 47, 49, 63]
     ^^^^^  ^^^^^^  ^^^^^^  ^^^^^^  ^^^^^^  ^^^^^^  
or maybe

 >>> reduce(lambda offs, n: offs+[offs[-1]+n],map(len,rx.split(s)),[0])
 [0, 8, 12, 13, 15, 20, 22, 42, 44, 45, 47, 47, 49, 63]

(Compare with above).

Regards,
Bengt Richter