Fast way of extracting files from various folders
subhabrata.banerji at gmail.com
subhabrata.banerji at gmail.com
Sat May 2 06:44:25 EDT 2015
On Saturday, May 2, 2015 at 2:52:32 PM UTC+5:30, Peter Otten wrote:
> wrote:
>
> > I have several millions of documents in several folders and subfolders in
> > my machine. I tried to write a script as follows, to extract all the .doc
> > files and to convert them in text, but it seems it is taking too much of
> > time.
> >
> > import os
> > from fnmatch import fnmatch
> > import win32com.client
> > import zipfile, re
> > def listallfiles2(n):
> > root = 'C:\Cand_Res'
> > pattern = "*.doc"
> > list1=[]
> > for path, subdirs, files in os.walk(root):
> > for name in files:
> > if fnmatch(name, pattern):
> > file_name1=os.path.join(path, name)
> > if ".doc" in file_name1:
> > #EXTRACTING ONLY .DOC FILES
> > if ".docx" not in file_name1:
> > #print "It is A Doc file$$:",file_name1
> > try:
> > doc = win32com.client.GetObject(file_name1)
> > text = doc.Range().Text
> > text1=text.encode('ascii','ignore')
> > text_word=text1.split()
> > #print "Text for Document File Is:",text1
> > list1.append(text_word)
> > print "It is a Doc file"
> > except:
> > print "DOC ISSUE"
> >
> > But it seems it is taking too much of time, to convert to text and to
> > append to list. Is there any way I may do it fast? I am using Python2.7 on
> > Windows 7 Professional Edition. Apology for any indentation error.
> >
> > If any one may kindly suggest a solution.
>
> It will not help the first time through your documents, but if you write the
> words for the word documents in one .txt file per .doc, and the original
> files rarely change you can read from the .txt files when you run your
> script a second time. Just make sure that the .txt is younger than the
> corresponding .doc by checking the file time.
>
> In short: use a caching strategy.
Thanks Peter. I'll surely check on that. Regards, Subhabrata Banerjee.
More information about the Python-list
mailing list