Identifying File type by reading files

Andrew Dalke adalke at mindspring.com
Fri Dec 26 14:46:08 EST 2003


hokiegal99:
> what should I look for in a file to determine whether or not it is a
> MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
> list of some of the strings I use to ID files, but I can't help but
> wonder that there must be a more precise way of doing this. I know of
> the Unix 'file' command. It is not very useful for me as it doesn't
> distinguish between MS Office documents... all .xls, .docs, .ppts are
> MS documents to it.

That likely means you have an incomplete 'magic' file.  This is the
file used by the 'file' command to figure out the file type.  Take a
look at  http://www.unixhideout.com/freebsd/share/misc/magic for
a more complete (I think) version.

That's dated 1995 and is close the one on my Mac.  It doesn't support
the newer MS Word and Excel formats. I'm having trouble
finding the most recent, definitive version.  One link pointed me
to ftp://ftp.astron.com/pub/file/ but I haven't investigated it further.

There's also a pymagic, http://thomas.mangin.me.uk/software/python.html
which may help for a pure Python implementation of 'file'.

                    Andrew
                    dalke at dalkescientific.com






More information about the Python-list mailing list