Fast way of extracting files from various folders

Irmen de Jong irmen.NOSPAM at xs4all.nl
Fri May 1 12:36:39 EDT 2015


On 1-5-2015 14:28, subhabrata.banerji at gmail.com wrote:
> Dear Group,
> 
> I have several millions of documents in several folders and subfolders in my machine.
> I tried to write a script as follows, to extract all the .doc files and to convert them in text, but it seems it is taking too much of time. 
> 

[snip]

> But it seems it is taking too much of time, to convert to text and to append to list. Is there any way I may do it fast? I am using Python2.7 on Windows 7 Professional Edition. Apology for any indentation error. 
> 
> If any one may kindly suggest a solution.

Have you profiled and identified the part of your script that is slow?

On first sight though your python code, while not optimal, contains no immediate
performance issues. It is likely the COM interop call to Winword and getting the text
via that interface that is slow. Imagine opening word for "several million documents",
no wonder it doesn't perform.

Investigate tools like antiword, wv, docx2txt. I suspect they're quite a bit faster than
relying on Word itself.


Irmen





More information about the Python-list mailing list