access to some text string in PDFs

Robert Pazur pazurrobert at gmail.com
Fri May 6 09:12:35 EDT 2011


Hi Chris,

thanks for fast reply and all recommendations in helps me much!
as you recommended me i used Pdfminer module to extract the text from pdf
files and then with file.xreadlines() I  allocated the lines where my
keyword ("factors in this case") appears.
Till now i extract just the lines but im wondering if its able to extract
whole sentenses (only this)   where my keawords ("factors in this case") are
located.

I used following script  >>

import os, subprocess

path="C:\\PDF"  # insert the path to the directory of interest here
dirList=os.listdir(path)
for fname in dirList:
    output =fname.rstrip(".pdf") + ".txt"
    subprocess.call(["C:\Python26\python.exe", "pdf2txt.py", "-o", output,
fname])
    print fname
    file = open(output)
    for line in file.xreadlines():
        if "driving" in line:
            print(line)

-------------------------------------------------------
Robert Pazur
Mobile : +421 948 001 705
Skype  : ruegdeg


2011/5/6 Chris Rebert <clp2 at rebertia.com>

> On Thu, May 5, 2011 at 2:26 PM, Robert Pazur <pazurrobert at gmail.com>
> wrote:
> > Dear all,
> > i would like to access some text and count the occurrence as follows >
> > I got a lots of pdf with some scientific articles and i want to preview
> >  which words are usually related with for example "determinants"
> > as an example in the article is a sentence > ....elevation is the most
> > important determinant....
> > how can i acquire the "elevation" string?
> > of course i dont know where the sententence in article is located or
> which
> > particular word could there be
> > any suggestions?
>
> Extract the text using PDFMiner[1], pyPdf[2], or PageCatcher[3]. Then
> use something similar to n-grams on the extracted text, filtering out
> those that don't contain "determinant(s)". Then just keep a word
> frequency table for the remaining n-grams.
>
> Not-quite-pseudo-code:
> from collections import defaultdict, deque
> N = 7 # length of n-grams to consider; tune as needed
> buf = deque(maxlen=N)
> targets = frozenset(("determinant", "determinants"))
> steps_until_gone = 0
> word2freq = defaultdict(int)
> for word in words_from_pdf:
>    if word in targets:
>        steps_until_gone = N
>    buf.append(word)
>    if steps_until_gone:
>        for related_word in buf:
>            if related_word not in targets:
>                word2freq[related_word] += 1
>        steps_until_gone -= 1
> for count, word in sorted((v,k) for k,v in word2freq.iteritems()):
>    print(word, ':', count)
>
> Making this more efficient and less naive is left as an exercise to the
> reader.
> There may very well already be something similar but more
> sophisticated in NLTK[4]; I've never used it, so I dunno.
>
> [1]: http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2]: http://pybrary.net/pyPdf/
> [3]: http://www.reportlab.com/software/#pagecatcher
> [4]: http://www.nltk.org/
>
> Cheers,
> Chris
> --
> http://rebertia.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110506/c8fa02ff/attachment-0001.html>


More information about the Python-list mailing list