[Tutor] extracting text from word files (.doc, .docx) and pdf

Tue Jan 25 23:21:13 CET 2011

On 01/25/2011 04:52 PM, Juan Jose Del Toro wrote:
> Dear List;
> 
> I am looking for a way to extract parts of a text from word (.doc,.docx)
> files as well as pdf; the idea is to walk through the whole directory tree
> and populate a csv file with an excerpt from each file.
> For PDF I found PyPdf <http://pybrary.net/pyPdf/>ave found nothing to read
> doc, docx
> 
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor

A docx file is a compressed XML file (or groups of files). I don't know
if there is a python module for it, but you could probably whip up your
own. I know 7z on Windows will extract a .docx (probably anything can if
you point to it, not sure). From there you'll need to explore the
structure and how Microsoft decided to use XML. ElementTree would
probably be useful here. Not sure about a doc file, a simple dd of a doc
file shows some garbage (probably useful for formatting ;-) as well as
the text. I found
http://code.activestate.com/recipes/279003-converting-word-documents-to-text/
.