Determining when a file is an Open Office Document

Steven D'Aprano steve at REMOVE.THIS.cybersource.com.au
Fri Jan 19 23:10:36 EST 2007


On Fri, 19 Jan 2007 12:48:14 -0800, Ross Ridge wrote:

> tubby wrote:
>> Now, If only I could something like that on PDF files :)
> 
> PDF files should begin with "%PDF-" followed by a version number, eg.
> "%PDF-1.4".  The PDF Reference notes that Adobe Acrobat Reader is a bit
> more flexiable about what it will accept:
> 
>     13. Acrobat viewers require only that the header appear
>           somewhere within the first 1024 bytes of the file.
>     14. Acrobat viewers also accept a header of the form
>           %!PS-Adobe-N.n PDF-M.m
> 
> So identifying PDF files is pretty easy.

Sure. MIS-identifying PDF files is pretty easy. Identifying them is not.
Consider this example:

$ cat not_a_pdf
%PDF-1.4
This is not a pdf file.
$ file not_a_pdf
not_a_pdf: PDF document, version 1.4

Is there a security vulnerability buried in the detection of file types by
magic bytes? I don't know, but I wouldn't be surprised if there were.

Here's another example:

$ cat not_a_gif.txt
GIF89a is the header used to define a GIF file.
$ file not_a_gif.txt
not_a_gif: GIF image data, version 89a, 26912 x 8307

Any file system that doesn't have file type metadata is reduced to
guessing the type of the file, and guesses can be wrong. As heuristics go,
"look at the characters after the dot in the file name" is not that much
worse than "look at the bytes at offset X through Y inside the file", and
has the significant advantage that it is visible and easy to change for
the end user.



-- 
Steven.




More information about the Python-list mailing list