PDF parser

Andreas Lobinger andreas.lobinger at netsurf.de
Fri Jul 30 05:35:58 EDT 2004


Aloha,

Radovan Garabik wrote:
> Christian Tismer <tismer at stackless.com> wrote:
>>need the bytecodehacks. I am writing a sophisticated package
>>which involves parsing of PDF files, and I want to do it all in
>>Python. In order to get this PDF processor to almost C speed,

When you need hacks to get reasonable speed for full parsing PDF,
your algorithms are not very efficiently designed...
BTDT
Even Acrobat sometimes works for seconds reading files.

> What license is your pdf parser going to have?
> Do you have a working version?
> My work plans include writing a pdf parser (and I prefer python for it),
> if your package is going to be open source

Can we make a list here, how many people have started writing
pdf processing SW?

I'm working now for ~20 months (i do this for enjoyment, not for a
specific task) on a pdf-low-level library that simply reads and
parses a .pdf to python data types and also writes
memory back to a file. And *yes* there is an intend to go public.

A few observations from the work on the lib. show, that it's not
the problem to get f.e. a pdf-tokenizer running (and running fast),
it's the real problem to understand PDF as a stuctured document
in depth. I changed some code forth and back because i wasn't sure
what description in mem would fit. (f.e. .pdf is simply a collection
of objects, is it a list, or a dict with the object number as key?)

Wishing a happy day
		LOBI



More information about the Python-list mailing list