Find offsets and lengths from re

Huaiyu Zhu huaiyu at gauss.almadan.ibm.com
Wed Oct 30 17:29:20 EST 2002


Bengt Richter <bokr at oz.net> wrote:
>On Tue, 29 Oct 2002 19:00:00 +0000 (UTC), huaiyu at gauss.almadan.ibm.com (Huaiyu Zhu) wrote:
>
>>Given a list of substrings, how to efficiently return a list of offsets and
>>lengths of all occurances of these substrings in a piece of text?
>>
>>I can use string find, looping over substrings and locations.  But this is
>>too slow for a large number of texts.
>>
>>The speed of re.findall appears adequate, but I have not found a way to let
>>re return offsets.  What it returns is a list of substrings that matches.  I
>>can use this list in a loop of string.find.  This reduces the original
>>double loop into a single loop.   This still takes quite some time.
>>
>>It would be more efficient if re can return offsets directly.  Is there a
>>way to do that?  
>>
>Does this help?
>
>http://www.python.org/doc/current/lib/match-objects.html

Thanks.  This helps somewhat.  I assume you mean using finditer and loop
over the the match objects.  This reduces the total running time of the
program from 8m40s to 7m31s, the bulk of its time spent on finding these
positions.  If there is a way to return positions without using a Python for
loop, I guess the running time can be halved.   I'm willing to patch the re
module if someone can point me to the right direction.

Huaiyu



More information about the Python-list mailing list