Find offsets and lengths from re

Bengt Richter bokr at oz.net
Thu Oct 31 06:10:23 EST 2002


On Thu, 31 Oct 2002 08:13:09 GMT, Alex Martelli <aleax at aleax.it> wrote:

>Huaiyu Zhu wrote:
>   ...
>>>>Given a list of substrings, how to efficiently return a list of offsets
>>>>and lengths of all occurances of these substrings in a piece of text?
>   ...
>>>http://www.python.org/doc/current/lib/match-objects.html
>> 
>> Thanks.  This helps somewhat.  I assume you mean using finditer and loop
>> over the the match objects.  This reduces the total running time of the
>> program from 8m40s to 7m31s, the bulk of its time spent on finding these
>> positions.  If there is a way to return positions without using a Python
>> for
>> loop, I guess the running time can be halved.   I'm willing to patch the
>> re module if someone can point me to the right direction.
>
>Avoiding a Python loop is quite easy thanks to the sub method of RE's --
>not sure if it will help you enough with performance, but, try:
>
>import re
>
>class OffsetsAndLengthFinder(object):
>
>    def __init__(self, substrings):
>        self.re = re.compile('|'.join(map(re.escape, substrings)))
>
>    def find(self, astring):
>        self._results = []
>        self.re.sub(self._sub, astring)
>        return self._results
>
>    def _sub(self, mo):
>        self._results.append((mo.start(), mo.end()-mo.start()))
>
>
>samplesubs = 'one two three four five six seven eight nine ten'.split()
>samplestring = '''bone tensix ninetofive attworsix'''
>
>finder = OffsetsAndLengthFinder(samplesubs)
>for offset, length in finder.find(samplestring):
>    print offset, length, samplestring[offset:offset+length]
>
Thanks, I had skimmed right past the fact that the the replacement string argument to sub
can be a function. That should come in handy. Though if you're building a list
with the function, it's not that different from

    rx = re.compile('|'.join(map(re.escape, samplesubs)))
    offsets = [(m.start(), m.end()) for m in rx.finditer(samplestring)]

and probably has a bit more overhead. (The result here is a tiny bit different since
I assume the print will subtract for the length and use a typical slice instead of v.v.)

    offsets = map(lambda m: (m.start(), m.end()), rx.finditer(samplestring))

should work too.

Regards,
Bengt Richter



More information about the Python-list mailing list