Identifying File type by reading files

Fri Dec 26 20:57:38 EST 2003

WOW. That's a great site. Thanks for the info!!!

Robin Munn <rmunn at pobox.com> wrote in message news:<fb3Hb.15500$aw2.8657450 at newssrv26.news.prodigy.com>...
> hokiegal99 <hokiegal99 at hotmail.com> wrote:
> > Are there certain sets of binary data that are unique to files that
> > would be a better way of identifying them? For example, on the N line
> > of a MS doc file begining at position X a binary string that is L
> > digits in lentgh that begins with B and ends with E will *ALWAYS* be
> > present... some one tell me that I'm not dreaming and that something
> > like the above example exists???
> >
> > A few of my string searches today:
> >
> > doc = string.find(file(os.path.join(root,fname), 'rb').read(),
> > 'Word.Document.')
> > xls = string.find(file(os.path.join(root,fname), 'rb').read(),
> > 'Excel.Sheet.')
> > pdf = string.find(file(os.path.join(root,fname), 'rb').read(),
> > 'PDF-1.')
> > jpg = string.find(file(os.path.join(root,fname), 'rb').read(), 'JFIF')
> >
> > Any suggestions or information that better describes how to positively
> > ID files w/o the possibiliy of mistake would be very helpful to me. As
> > of now, some of my files, though not many (~ 2%) will be given the
> > wrong extension, but the logic of the functions is such that they
> > append any extension that probably applies to the file so at that
> > point it is a simple process of elimination to determine which
> > extension is actually the correct one. Normally, I never have more
> > than 2 unique extensions attached to the same file.
> 
> Glutton for punishment, aren't you? :-)
> 
> Seriously, that is a non-trivial problem. If that's what you're trying
> to do, though, the file format documentation at http://www.wotsit.org/
> may be useful to you. Good luck!