converting file formats to txt

Steven D'Aprano steve at REMOVETHIScyber.com.au
Tue Jul 4 10:49:43 EDT 2006


On Tue, 04 Jul 2006 06:32:13 -0700, Gaurav Agarwal wrote:

> Hi,
> 
> I wanted a script that can convert any file format (RTF/DOC/HTML/PDF/PS
> etc) to text format.

RTF, HTML and PS are already text format.

DOC is a secret, closed proprietary format. It will be a lot of work
reverse-engineering it. Perhaps you should consider using existing tools
that already do it -- see, for example, the word processors Abiword and
OpenOffice. They are open-source, so you can read and learn from their
code. Alternatively, you could try some of the suggestions here:

http://www.linux.com/article.pl?sid=06/02/22/201247

Or you could just run through the .doc file, filtering out binary
characters, and display just the text characters. That's a quick-and-dirty
strategy that might help.

PDF is (I believe) a compressed, binary format of PS. Perhaps you should
look at the program pdf2ps -- maybe it will help.

If you explain your needs in a little more detail, perhaps people can give
you answers which are a little more helpful.



-- 
Steven.




More information about the Python-list mailing list