converting file formats to txt
Steven D'Aprano
steve at REMOVETHIScyber.com.au
Tue Jul 4 10:49:43 EDT 2006
On Tue, 04 Jul 2006 06:32:13 -0700, Gaurav Agarwal wrote:
> Hi,
>
> I wanted a script that can convert any file format (RTF/DOC/HTML/PDF/PS
> etc) to text format.
RTF, HTML and PS are already text format.
DOC is a secret, closed proprietary format. It will be a lot of work
reverse-engineering it. Perhaps you should consider using existing tools
that already do it -- see, for example, the word processors Abiword and
OpenOffice. They are open-source, so you can read and learn from their
code. Alternatively, you could try some of the suggestions here:
http://www.linux.com/article.pl?sid=06/02/22/201247
Or you could just run through the .doc file, filtering out binary
characters, and display just the text characters. That's a quick-and-dirty
strategy that might help.
PDF is (I believe) a compressed, binary format of PS. Perhaps you should
look at the program pdf2ps -- maybe it will help.
If you explain your needs in a little more detail, perhaps people can give
you answers which are a little more helpful.
--
Steven.
More information about the Python-list
mailing list