Find offsets and lengths from re

Alex Martelli aleax at aleax.it
Thu Oct 31 03:13:09 EST 2002


Huaiyu Zhu wrote:
   ...
>>>Given a list of substrings, how to efficiently return a list of offsets
>>>and lengths of all occurances of these substrings in a piece of text?
   ...
>>http://www.python.org/doc/current/lib/match-objects.html
> 
> Thanks.  This helps somewhat.  I assume you mean using finditer and loop
> over the the match objects.  This reduces the total running time of the
> program from 8m40s to 7m31s, the bulk of its time spent on finding these
> positions.  If there is a way to return positions without using a Python
> for
> loop, I guess the running time can be halved.   I'm willing to patch the
> re module if someone can point me to the right direction.

Avoiding a Python loop is quite easy thanks to the sub method of RE's --
not sure if it will help you enough with performance, but, try:

import re

class OffsetsAndLengthFinder(object):

    def __init__(self, substrings):
        self.re = re.compile('|'.join(map(re.escape, substrings)))

    def find(self, astring):
        self._results = []
        self.re.sub(self._sub, astring)
        return self._results

    def _sub(self, mo):
        self._results.append((mo.start(), mo.end()-mo.start()))


samplesubs = 'one two three four five six seven eight nine ten'.split()
samplestring = '''bone tensix ninetofive attworsix'''

finder = OffsetsAndLengthFinder(samplesubs)
for offset, length in finder.find(samplestring):
    print offset, length, samplestring[offset:offset+length]



Alex




More information about the Python-list mailing list