PDF to text script

Nick Vatamaniuc vatamane at gmail.com
Fri Nov 10 22:28:10 EST 2006


Vyz wrote:
> I am looking for a PDF to text script. I am working with multibyte
> language PDFs on Windows Xp. I need to batch convert them to text and
> feed into an encoding converter program
>
> Thanks for any help in this regard

Multibyte languages are not easy.  I do text extraction from PDF but 1)
I do it on Linux and 2) I only need English text. The utility I use is
pdftotext that comes as part of XPDF *nix package.

The other problem however, is not with the extraction but with the fact
that after you extract the text, it might not look very good.  In other
words, the extraction program will never complain but will nevertheless
produce garbage.  Then you have to process the result yourself. For
example, whitespace is not consistent, sometimes there will be extra
whitespace -- sometimes there won't be enough for example " S o m  e
  w ordsloo l i k e t his" and so on...

The real answer is that pdf text extraction is pretty hard. It is a
1000x better to get a hold of the original source...

Nick V.




More information about the Python-list mailing list