reading text in pdf, some working sample code

dieter dieter at handshake.de
Wed Nov 22 02:37:20 EST 2017


Daniel Gross <grossd18 at gmail.com> writes:
> I am new to python and jumped right into trying to read out (english) text
> from PDF files.
>
> I tried various libraries (including slate)

You could give "pdfminer" a try.

Note, however, that it may not be possible to extract the text:
PDF is a generic format which works by mapping character codes to glyphs
(i.e. visual symbols); if your PDF uses a special map for this
(especially with non standard glyph collections (aka "font"s)),
then the text extraction (which in fact extracts sequences
of character codes) can give unusable results.




More information about the Python-list mailing list