[Tutor] re.findall(), but with overlaps?

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Sat Sep 3 23:38:46 CEST 2005



> I may put in an enhancement request to change the name of re.findall to
> re.findsome.

Hi Terry,

A typical use of regular expressions is to break text into a sequence of
non-overlapping tokens.  There's nothing that technically stops us from
applying the theory of regular expressions to get overlapping matches, but
that use case is rare enough that it probably won't get into the Standard
Library anytime soon.  A third-party approach, to write customized code
that allow overlaps, will probably work better.

You may want to ask on comp.lang.python and see if someone else has had
the need for overlapping matches --- there might be other people who've
run into that problem too.

I've helped to adapt a specialized pattern matcher for Python; not sure if
this might interest you, but:

    http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

and the Aho-Corasick search automaton that I've adapted does do
overlapping matches of keywords:

######
>>> import ahocorasick
>>> tree = ahocorasick.KeywordTree()
>>> for i in range(ord('A'), ord('Z') + 1):
...     tree.add('B' + chr(i) + 'B')
...
>>> tree.make()
>>> tree.findall('BABBEBIB', allow_overlaps = True)
<generator object at 0x403a9fec>
>>> list(tree.findall('BABBEBIB', allow_overlaps = True))
[(0, 3), (3, 6), (5, 8)]
######

The ahocorasick module doesn't provide full regexp power (and the example
shows that I have to simulate wildcards... *grin*), but it might still be
useful, depending on what you're really trying to do.  The link above also
refers to Nicolas Nehuen's 'pytst' module, which might also be useful for
you.

Best of wishes to you!



More information about the Tutor mailing list