Python File Handling: .xls .doc .pdf ???

Ian Bicking ianb at colorstudy.com
Fri Feb 7 04:22:07 EST 2003


On Fri, 2003-02-07 at 02:20, Ken Favrow wrote:
>     I'm trying to make a somewhat simple search engine, but would need to be
> able to read .xls .doc and possibly .pdf for it to be entirely useful. I
> just need to be able to see enough content to find keywords. I've already
> done it with txt and html. How might I accomplish this with the other
> formats??

wvWare (wvware.sf.net) reads Word documents decently -- it's a
command-line tool.  If you just want the properties then wvSummary (part
of wvWare) will give you that, otherwise wvText will dump the text. 
wvSummary should work on .xls documents as well, as it's just another
OLE document, and metadata (author, summary, etc) is identical for .doc
and .xls documents (and .ppt, etc).  I've used wvWare successfully from
Python.

For PDF Ghostscript includes some utilities to convert PDFs.  XPDF also
includes a pdftotext converter.  There's probably many more utilities as
well.



-- 
Ian Bicking  ianb at colorstudy.com  http://colorstudy.com
4869 N. Talman Ave., Chicago, IL 60625  /  773-275-7241
"There is no flag large enough to cover the shame of 
 killing innocent people" -- Howard Zinn





More information about the Python-list mailing list