[Tutor] Parsing Word Docs

Tim Golden mail at timgolden.me.uk
Thu Mar 8 17:06:17 CET 2007


Stephen Nelson-Smith wrote:
> Hello all,
> 
> I have a directory containing a load of word documents, say 100 or so.
> which is updated every hour.
> 
> I want a cgi script that effectively does a grep on the word docs, and
> returns each doc that matches the search term.
> 
> I've had a look at doing this by looking at each binary file and
> reimplementing strings(1) to capture useful info.  I've also read that
> one can treat a word doc as a COM object.  Am I right in thinking that
> I can't do this on python under unix?
> 
> What other ways are there?  Or is the binary parsing the way to go?

Simplest thing's probably antiword (http://www.winfield.demon.nl/)
and then whatever text-scanning approach you want.

TJG


More information about the Tutor mailing list