PDF parser

Fri Jul 30 13:04:51 EDT 2004

Andreas Lobinger <andreas.lobinger at netsurf.de> wrote in message news:<ced4pv$m5d$1 at news.mch.sbs.de>...

> Can we make a list here, how many people have started writing
> pdf processing SW?

Time to jump in here. I started one about three years ago:

http://www.boddie.org.uk/david/Projects/Python/pdftools/

It still isn't finished...

> I'm working now for ~20 months (i do this for enjoyment, not for a
> specific task) on a pdf-low-level library that simply reads and
> parses a .pdf to python data types and also writes
> memory back to a file. And *yes* there is an intend to go public.

I didn't get to the part where I could write the data out to a file.
My original aim was to be able to view files rather than modify them.

> A few observations from the work on the lib. show, that it's not
> the problem to get f.e. a pdf-tokenizer running (and running fast),
> it's the real problem to understand PDF as a stuctured document
> in depth. I changed some code forth and back because i wasn't sure
> what description in mem would fit. (f.e. .pdf is simply a collection
> of objects, is it a list, or a dict with the object number as key?)

You should think of it as a dictionary, I believe. I don't think that
the order of objects in the file really matters; just the order in the
list of used objects in the file's trailer.

Good luck,

David