Searching a binary file

Samuel Scarano srs25 at cornell.edu
Tue Jul 25 11:06:11 EDT 2000


Interesting problem. I'd like to propose a generalized one which I may be
facing soon:

Suppose you want to search with an *arbitrary* regular expression. In this
case, you don't know how long the matching string may be, so you don't know
how much overlap you need between buffers.

Is it necessary, perhaps, to use a constant buffer-overlap size, thus
placing a bound on the length of a matched string?

Just to make it harder, suppose the buffers are numerous and arbitrarily
small -- sometimes smaller than the matched-string length bound.

Is there a way to get the regular expression engine to save its state at the
end of one buffer and pick up where it left off on the next? I'm almost
certain that this is possible (at least with the sort of naive DFA-based
regex implementation that I have in mind), but has such functionality been
made available in python?

luthi at vaw.baug.ethz.NOSPAM.ch wrote:
> 
> What is the fastest way to search a binary file for a certain byte pattern
> (**** in my case)?
> 
> I came up with this solution, but I guess there is a better way to do
> it. Thanks for all suggestions.
> 
> Martin Luethi
> 
> =====
> 
> # constants
> MaxFileSizeInMemory = 10*1000*1000
> 
> file = open('my_binary_file', 'b')
> 
> file.seek(0,2)                   # set the pointer to the end of the file
> filesize = file.tell()           # get the size of the file
> file.seek(0)                     # reset the file pointer
> rex = re.compile(r'[\*]{4}')     # the bytes to search for
> starpointer = []                 # position of the '****' in the file
> oldpos = 0
> for i in range(filesize/MaxFileSizeInMemory + 1):
>     data = self.file.read(MaxFileSizeInMemory) # read a chunk of bytes
>     m = rex.search(data)
>     while m:                     # find all '****' in this chunk
>         pos = m.start() - 4      # corrected for the length of '****'
>         incpointer.append(pos + i*MaxFileSizeInMemory)
>         m = rex.search(data, pos + 10)
>         oldpos = pos
> file.close()                     # close the file
> 
> --
> ============================================================
> Martin Luethi                   Tel. +41 1 632 40 92
> VAW ETH Zuerich
> CH-8092 Zuerich                 mail luthi at vaw.baum.ethz.ch
> Switzerland
> ============================================================

-- 
Samuel R. Scarano                        "Due to circumstances beyond my
undergraduate, Cornell University        control, I am master of my fate
<srs25 at cornell.edu>                      and captain of my soul."
http://people.cornell.edu/pages/srs25/             -- Ashleigh Brilliant



More information about the Python-list mailing list