reading PDF using Python [Q]

Nick Moon ncmoon at cix.compulink.co.uk
Tue May 11 05:38:31 EDT 1999


> > I have been playing with parsing pdf files in python. The format
> > of .pdf is documented on Adobe's web site.
> 
> Any usefull URL?

Try the adobe site. www.adobe.com but you knew that. The document you want 
is called 'Portable Document Format Reference Manual - Version 1.2'. 
Though I think Acrobat v4 means there is now a version 1.3. It's in 
surprisingly .pdf format and it's big - about 400 pages when printed.

It is pretty unreadable, but it does describe the file format in mind 
numbingly boring detail. The pdf format itself, looks like the work of 
several different people over several different years. Different bits of 
the format seem to use rather different styles of data structures. 


> Do you know more about PDF encryption and compression?

PDF files have a general structure, something like: A header, A list of 
objects, A lookup table, An end. The lookup table is a list of offsets to 
each object. It allows program to open the file from the end and then jump 
direct to each object as required. Updates can be appended to a file 
without changing any of the contents of the file. The updates consist of 
some objects and a new lookup table and end section. 

Actual page descriptions which is probably what you want to look at are 
stored in a stream - the stream is then inside an object. Streams may be 
written/read using various filters. A typical filter set would be:

ASCII85Decode / LZWDecode

Which means it has been compressed using LZW then the binary output of LZW 
has been turned into ASCII (base 85)


Cheers,



Nick.
 
 









More information about the Python-list mailing list