Bit substring search

Scott David Daniels Scott.Daniels at Acm.Org
Wed Jun 25 09:11:05 EDT 2008


Kris Kennaway wrote:
> Thanks for the pointers, I think a C extension will end up being the way 
> to go, unless someone has beaten me to it and I just haven't found it yet.

Depending on the pattern length you are targeting, it may be fastest to
increase the out-of-loop work.  For a 40-bit string, build an 8-target
Aho-Corasick machine, and at each match check the endpoints.  This will
only work well if 40 bits is at the low end of what you are hunting for.

Roughly:

     targets[0] = 5-byte string as byte-aligned
     targets[N in 1..7] = 4-byte strings representing the lead
                    four bytes after discarding the high-order N bits
     m = AhoCorasick(targets)

     def hunt(m, source):
         m.state = 0
         old_block = None
         for n, block in enumerate(source):
             for recognized, where in m.search(block):
                 if recognized:
                     if <check lead and trail bits>:
                         yield block, where, recognized
                 else:
                     yield block, where, 0

--Scott David Daniels
Scott.Daniels at Acm.Org



More information about the Python-list mailing list