Find offsets and lengths from re
Bengt Richter
bokr at oz.net
Thu Oct 31 06:10:23 EST 2002
On Thu, 31 Oct 2002 08:13:09 GMT, Alex Martelli <aleax at aleax.it> wrote:
>Huaiyu Zhu wrote:
> ...
>>>>Given a list of substrings, how to efficiently return a list of offsets
>>>>and lengths of all occurances of these substrings in a piece of text?
> ...
>>>http://www.python.org/doc/current/lib/match-objects.html
>>
>> Thanks. This helps somewhat. I assume you mean using finditer and loop
>> over the the match objects. This reduces the total running time of the
>> program from 8m40s to 7m31s, the bulk of its time spent on finding these
>> positions. If there is a way to return positions without using a Python
>> for
>> loop, I guess the running time can be halved. I'm willing to patch the
>> re module if someone can point me to the right direction.
>
>Avoiding a Python loop is quite easy thanks to the sub method of RE's --
>not sure if it will help you enough with performance, but, try:
>
>import re
>
>class OffsetsAndLengthFinder(object):
>
> def __init__(self, substrings):
> self.re = re.compile('|'.join(map(re.escape, substrings)))
>
> def find(self, astring):
> self._results = []
> self.re.sub(self._sub, astring)
> return self._results
>
> def _sub(self, mo):
> self._results.append((mo.start(), mo.end()-mo.start()))
>
>
>samplesubs = 'one two three four five six seven eight nine ten'.split()
>samplestring = '''bone tensix ninetofive attworsix'''
>
>finder = OffsetsAndLengthFinder(samplesubs)
>for offset, length in finder.find(samplestring):
> print offset, length, samplestring[offset:offset+length]
>
Thanks, I had skimmed right past the fact that the the replacement string argument to sub
can be a function. That should come in handy. Though if you're building a list
with the function, it's not that different from
rx = re.compile('|'.join(map(re.escape, samplesubs)))
offsets = [(m.start(), m.end()) for m in rx.finditer(samplestring)]
and probably has a bit more overhead. (The result here is a tiny bit different since
I assume the print will subtract for the length and use a typical slice instead of v.v.)
offsets = map(lambda m: (m.start(), m.end()), rx.finditer(samplestring))
should work too.
Regards,
Bengt Richter
More information about the Python-list
mailing list