Determining when a file is an Open Office Document

Steven D'Aprano steve at REMOVE.THIS.cybersource.com.au
Fri Jan 19 07:00:45 EST 2007


On Fri, 19 Jan 2007 12:22:04 +1100, Ben Finney wrote:

> tubby <tubby at bandaheart.com> writes:
> 
>> Silly question, but here goes... what's a good way to determine when
>> a file is an Open Office document? I could look at the file
>> extension, but it seems there would be a better way.
> 
> Yes, the name of a file may be useful for communicating with humans
> about that file's intended use, but is a lousy, unreliable way to make
> a definite statement about the actual contents of the file.
> 
> The Unix 'file' command determines the type of a file by its contents,
> not its name. This functionality is essentially a database of "magic"
> byte patterns mapping to file types, 

Ah, another lousy, unreliable way to make a definite statement about the
actual contents of a file. Looking at magic bytes inside a file is hardly
bullet-proof (although file seems to be moderately reliable in practice,
at least under Linux).

Simple example: is the file consisting of two bytes "x09x0A" meant to be a
text file with a tab and a newline, or a binary file consisting of a
single two-byte int? There's no way to tell just from the contents.
It's a circular problem: to be sure what the file is ("it's a two-byte
int") one has to understand the contents ("the integer 2305") -- but you
can only understand the contents if you know what the file is.

There are only two ways out of this vicious circle: 

(1) Have the creator of the file unambiguously label it. Some file systems
associate file-type metadata to files (e.g. Classic Apple Macintosh did
that), but sadly the main file systems in use today do not.

(2) Make an educated guess from various heuristics and conventions. The
old DOS 8.3 naming system is one such convention, and modern operating
systems tend to follow it. The Unix "file" utilities database of magic
bytes is such a heuristic.


-- 
Steven.




More information about the Python-list mailing list