Identifying File type by reading files

Fri Dec 26 18:07:55 EST 2003

hokiegal99 <hokiegal99 at hotmail.com> wrote:
> Are there certain sets of binary data that are unique to files that
> would be a better way of identifying them? For example, on the N line
> of a MS doc file begining at position X a binary string that is L
> digits in lentgh that begins with B and ends with E will *ALWAYS* be
> present... some one tell me that I'm not dreaming and that something
> like the above example exists???
>
> A few of my string searches today:
>
> doc = string.find(file(os.path.join(root,fname), 'rb').read(),
> 'Word.Document.')
> xls = string.find(file(os.path.join(root,fname), 'rb').read(),
> 'Excel.Sheet.')
> pdf = string.find(file(os.path.join(root,fname), 'rb').read(),
> 'PDF-1.')
> jpg = string.find(file(os.path.join(root,fname), 'rb').read(), 'JFIF')
>
> Any suggestions or information that better describes how to positively
> ID files w/o the possibiliy of mistake would be very helpful to me. As
> of now, some of my files, though not many (~ 2%) will be given the
> wrong extension, but the logic of the functions is such that they
> append any extension that probably applies to the file so at that
> point it is a simple process of elimination to determine which
> extension is actually the correct one. Normally, I never have more
> than 2 unique extensions attached to the same file.

Glutton for punishment, aren't you? :-)

Seriously, that is a non-trivial problem. If that's what you're trying
to do, though, the file format documentation at http://www.wotsit.org/
may be useful to you. Good luck!

-- 
Robin Munn
rmunn at pobox.com