PDF parser
Andreas Lobinger
andreas.lobinger at netsurf.de
Fri Jul 30 05:35:58 EDT 2004
Aloha,
Radovan Garabik wrote:
> Christian Tismer <tismer at stackless.com> wrote:
>>need the bytecodehacks. I am writing a sophisticated package
>>which involves parsing of PDF files, and I want to do it all in
>>Python. In order to get this PDF processor to almost C speed,
When you need hacks to get reasonable speed for full parsing PDF,
your algorithms are not very efficiently designed...
BTDT
Even Acrobat sometimes works for seconds reading files.
> What license is your pdf parser going to have?
> Do you have a working version?
> My work plans include writing a pdf parser (and I prefer python for it),
> if your package is going to be open source
Can we make a list here, how many people have started writing
pdf processing SW?
I'm working now for ~20 months (i do this for enjoyment, not for a
specific task) on a pdf-low-level library that simply reads and
parses a .pdf to python data types and also writes
memory back to a file. And *yes* there is an intend to go public.
A few observations from the work on the lib. show, that it's not
the problem to get f.e. a pdf-tokenizer running (and running fast),
it's the real problem to understand PDF as a stuctured document
in depth. I changed some code forth and back because i wasn't sure
what description in mem would fit. (f.e. .pdf is simply a collection
of objects, is it a list, or a dict with the object number as key?)
Wishing a happy day
LOBI
More information about the Python-list
mailing list