Python File Handling: .xls .doc .pdf ???

Gerhard Häring gerhard.haering at opus-gmbh.net
Fri Feb 7 03:35:43 EST 2003


Ken Favrow <KenFavrow at attbi.com> wrote:
> I'm trying to make a somewhat simple search engine, but would need to be
> able to read .xls .doc and possibly .pdf for it to be entirely useful. I
> just need to be able to see enough content to find keywords. I've already
> done it with txt and html. How might I accomplish this with the other
> formats??

There are various utilities to convert these formats into plain text:
antiword, catdoc, xlHtml, ...

Some of these converters produce HTML. But HTML can be easily converted to
plain text: $commandline_browser -dump <html-file> where
commandline_browser in ('lynx', 'w3m', 'links').

http://www.spocom.com/users/gjohnson/mutt/#office might be of interest to
you, as it includes links to all of these utilities.

-- Gerhard




More information about the Python-list mailing list