PDF parser
David Boddie
davidb at mcs.st-and.ac.uk
Fri Jul 30 13:04:51 EDT 2004
Andreas Lobinger <andreas.lobinger at netsurf.de> wrote in message news:<ced4pv$m5d$1 at news.mch.sbs.de>...
> Can we make a list here, how many people have started writing
> pdf processing SW?
Time to jump in here. I started one about three years ago:
http://www.boddie.org.uk/david/Projects/Python/pdftools/
It still isn't finished...
> I'm working now for ~20 months (i do this for enjoyment, not for a
> specific task) on a pdf-low-level library that simply reads and
> parses a .pdf to python data types and also writes
> memory back to a file. And *yes* there is an intend to go public.
I didn't get to the part where I could write the data out to a file.
My original aim was to be able to view files rather than modify them.
> A few observations from the work on the lib. show, that it's not
> the problem to get f.e. a pdf-tokenizer running (and running fast),
> it's the real problem to understand PDF as a stuctured document
> in depth. I changed some code forth and back because i wasn't sure
> what description in mem would fit. (f.e. .pdf is simply a collection
> of objects, is it a list, or a dict with the object number as key?)
You should think of it as a dictionary, I believe. I don't think that
the order of objects in the file really matters; just the order in the
list of used objects in the file's trailer.
Good luck,
David
More information about the Python-list
mailing list