Searching a binary file -- More concrete instance of generalized searching problem

Samuel Scarano srs25 at cornell.edu
Tue Jul 25 11:50:37 EDT 2000


Me again. I just thought of a good example that illustrates what I'm
thinking of.

Suppose the string you're matching against comes in incrementally -- as
lines of text to stdin, or perhaps a TCP stream. The total length can be 
arbitrarily large, so you don't want to just concatenate incoming data to a
perpetually growing string. And the match may contain the first character,
so you can't just discard old data (e.g., matching 'a.*b' against a string
that starts with 'a' and whose first 'b' is at the billionth position).

I know its *possible* -- at least with "pure" regular expressions -- because
they define regular languages, and membership in a regular language can be
determined by scanning a string from left to right without storing a single
character -- all that must be stored is the state of the DFA corresponding
to the regex.

Mmmm... perhaps I should forget all this just impose a bound on the match
length. But even then, every time you get a new piece of the string in, you
have to re-search that overlap -- that's a *lot* of redundant searching. Too
ugly. A regex engine that saves its state would be very useful I think....

Samuel Scarano wrote:
> 
> Interesting problem. I'd like to propose a generalized one which I may be
> facing soon:
> 
> Suppose you want to search with an *arbitrary* regular expression. In this
> case, you don't know how long the matching string may be, so you don't know
> how much overlap you need between buffers.
> 
> Is it necessary, perhaps, to use a constant buffer-overlap size, thus
> placing a bound on the length of a matched string?
> 
> Just to make it harder, suppose the buffers are numerous and arbitrarily
> small -- sometimes smaller than the matched-string length bound.
> 
> Is there a way to get the regular expression engine to save its state at the
> end of one buffer and pick up where it left off on the next? I'm almost
> certain that this is possible (at least with the sort of naive DFA-based
> regex implementation that I have in mind), but has such functionality been
> made available in python?
> 
> luthi at vaw.baug.ethz.NOSPAM.ch wrote:
> >
> > What is the fastest way to search a binary file for a certain byte pattern
> > (**** in my case)?
> >
> > I came up with this solution, but I guess there is a better way to do
> > it. Thanks for all suggestions.
> >
> > Martin Luethi
> >
> > =====
> >
> > # constants
> > MaxFileSizeInMemory = 10*1000*1000
> >
> > file = open('my_binary_file', 'b')
> >
> > file.seek(0,2)                   # set the pointer to the end of the file
> > filesize = file.tell()           # get the size of the file
> > file.seek(0)                     # reset the file pointer
> > rex = re.compile(r'[\*]{4}')     # the bytes to search for
> > starpointer = []                 # position of the '****' in the file
> > oldpos = 0
> > for i in range(filesize/MaxFileSizeInMemory + 1):
> >     data = self.file.read(MaxFileSizeInMemory) # read a chunk of bytes
> >     m = rex.search(data)
> >     while m:                     # find all '****' in this chunk
> >         pos = m.start() - 4      # corrected for the length of '****'
> >         incpointer.append(pos + i*MaxFileSizeInMemory)
> >         m = rex.search(data, pos + 10)
> >         oldpos = pos
> > file.close()                     # close the file
> >
> > --
> > ============================================================
> > Martin Luethi                   Tel. +41 1 632 40 92
> > VAW ETH Zuerich
> > CH-8092 Zuerich                 mail luthi at vaw.baum.ethz.ch
> > Switzerland
> > ============================================================
> 
> --
> Samuel R. Scarano                        "Due to circumstances beyond my
> undergraduate, Cornell University        control, I am master of my fate
> <srs25 at cornell.edu>                      and captain of my soul."
> http://people.cornell.edu/pages/srs25/             -- Ashleigh Brilliant

-- 
Samuel R. Scarano                        "Due to circumstances beyond my
undergraduate, Cornell University        control, I am master of my fate
<srs25 at cornell.edu>                      and captain of my soul."
http://people.cornell.edu/pages/srs25/             -- Ashleigh Brilliant



More information about the Python-list mailing list