Identifying File type by reading files

Gabriel Genellina gagenellina at softlab.com.ar
Fri Dec 26 21:17:02 EST 2003


At 26/12/2003 10:55, you wrote:

>I have some functions that search for files that contain certian
>strings and if the files found to have these string do not already
>have a filename extension (such as '.doc' or '.xls') the function will
>append that to the files and rename them. So, if a file named 'report'
>was found to have the string 'Microsoft' and the string
>'Word.Document.' (notice the '.' at the end of both words) and it does
>not already have an extension, then a rename would take place that
>would name the file 'report.doc'
>
>These functions work very well on most files (98% guessed correctly).
>However, I would like the functions to be more precise (100%). So,
>what should I look for in a file to determine whether or not it is a
>MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
>list of some of the strings I use to ID files, but I can't help but
>wonder that there must be a more precise way of doing this. I know of
>the Unix 'file' command. It is not very useful for me as it doesn't
>distinguish between MS Office documents... all .xls, .docs, .ppts are
>MS documents to it.

The various Office applications *used* to store data in OLE2 Compound 
Document format. If your program is running in Windows you could try 
IStorage & Co. to read and detect document types (more reliable than 
detecting strings and magic numbers).
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/stg/stg/istorage.asp
The binary format is documented anywhere inside msdn, so you could use a 
magic-number approach too.

But now I've been told (I dont use them) that Office 2003 (maybe Office 
2000 too?) stores documents in XML format, and Word sometimes uses RTF too 
(even with .doc extension) so you should check that too.


Gabriel Genellina
Softlab SRL






More information about the Python-list mailing list