reading text in pdf, some working sample code

Paul Moore p.f.moore at gmail.com
Tue Nov 21 14:35:29 EST 2017


I haven't tried it, but a quick Google search found PyPDF2 -
https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python

You don't give much detail about what you tried and how it failed, so
if the above doesn't work for you, I'd suggest providing more detail
as to what your problem is.

Paul

On 21 November 2017 at 15:18, Daniel Gross <grossd18 at gmail.com> wrote:
> Hi,
>
> I am new to python and jumped right into trying to read out (english) text
> from PDF files.
>
> I tried various libraries (including slate) out there but am running into
> diverse problems, such as with encoding or buffer too small errors -- deep
> inside some decompression code.
>
> Essentially, i want to extract all text and then do some natural language
> processing on the text. Is there some sample code available that works
> together with a clear description of the expected python installatin
> environment needed.
>
> In slate btw, i got the buffer error, it seems i must "guess" the right
> encoding of the text included in the PDF when opening the file. Still
> trying to figure out how to get the encoding info out of the PDF ... (if
> available there)
>
> thank you,
>
> Daniel
> --
> https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list