Find offsets and lengths from re
Alex Martelli
aleax at aleax.it
Thu Oct 31 03:13:09 EST 2002
Huaiyu Zhu wrote:
...
>>>Given a list of substrings, how to efficiently return a list of offsets
>>>and lengths of all occurances of these substrings in a piece of text?
...
>>http://www.python.org/doc/current/lib/match-objects.html
>
> Thanks. This helps somewhat. I assume you mean using finditer and loop
> over the the match objects. This reduces the total running time of the
> program from 8m40s to 7m31s, the bulk of its time spent on finding these
> positions. If there is a way to return positions without using a Python
> for
> loop, I guess the running time can be halved. I'm willing to patch the
> re module if someone can point me to the right direction.
Avoiding a Python loop is quite easy thanks to the sub method of RE's --
not sure if it will help you enough with performance, but, try:
import re
class OffsetsAndLengthFinder(object):
def __init__(self, substrings):
self.re = re.compile('|'.join(map(re.escape, substrings)))
def find(self, astring):
self._results = []
self.re.sub(self._sub, astring)
return self._results
def _sub(self, mo):
self._results.append((mo.start(), mo.end()-mo.start()))
samplesubs = 'one two three four five six seven eight nine ten'.split()
samplestring = '''bone tensix ninetofive attworsix'''
finder = OffsetsAndLengthFinder(samplesubs)
for offset, length in finder.find(samplestring):
print offset, length, samplestring[offset:offset+length]
Alex
More information about the Python-list
mailing list