Analyse of PDF (or EPS?)

Johan Holst Nielsen johan at weknowthewayout.com
Fri Nov 21 07:10:41 EST 2003


David Boddie wrote:
  >>>Is there any Python packages to analyse or get some information out of
>>>an PDF document...
>>>
>>>Like where the text are placed - what text are placed - fonts, embedded 
>>>PDFs/fonts/images etc.
> 
> It depends on the type of images (bitmap vs. vector).

Yes I know - but the vector based images should be extracted just as it 
is - bitmap as selfcontained files :=)

> 
>>IIRC you can get the full specs of pdf and eps at the adobe site.
> 
> The full PDF specification is not exactly short, but it's fairly readable.

Yep... I tried it... but there are no reason to do exactly the same - if 
other people already have done that. And time is an issue too ;)

> 
>>Some stuff is easy to get at, some may be compressed and/or encrypted,
>>and not so easy.
> 
> Although the FlateDecode compression format is straightforward with existing
> libraries, some of the other compression techniques may be less accessible.

Well, no problem with the compression/encrypting. It is for an internal 
application - so people just HAVE to not encrypt or secure the document.

>>Conforming docs are supposed to be structured so that it is relatively easy
>>to grab chunks of document and do the kinds of things printing business s/w does,
>>like rotating and scaling and reordering pages, etc.
> 
> I have a Python library which is able to identify a lot of the structure in simple
> documents, including basic text extraction, but I've become pretty disillusioned
> with it because so much work is required to extract more complex information.
> 
> Maybe it's time to stick a license on it and upload it somewhere.

Well, let me know ;) Maybe I could get an demo or something? That would 
be nice :)

Regards,
Johan





More information about the Python-list mailing list