converting file formats to txt

Gaurav Agarwal gaurav.agarwal1904 at gmail.com
Tue Jul 4 11:38:47 EDT 2006


Thanks Steven, Actually i wanted a do text processing for my office
where I can view all files in the system and use the first three to
give a summary of the document. Instead of having somebody actually
entering the summary. Seems there is no one code that can act as
convertor across formats, i'll have to check out convertors for
individual formats.

Thanks and Regards,
Gaurav Agarwal

Steven D'Aprano wrote:
> On Tue, 04 Jul 2006 06:32:13 -0700, Gaurav Agarwal wrote:
>
> > Hi,
> >
> > I wanted a script that can convert any file format (RTF/DOC/HTML/PDF/PS
> > etc) to text format.
>
> RTF, HTML and PS are already text format.
>
> DOC is a secret, closed proprietary format. It will be a lot of work
> reverse-engineering it. Perhaps you should consider using existing tools
> that already do it -- see, for example, the word processors Abiword and
> OpenOffice. They are open-source, so you can read and learn from their
> code. Alternatively, you could try some of the suggestions here:
>
> http://www.linux.com/article.pl?sid=06/02/22/201247
>
> Or you could just run through the .doc file, filtering out binary
> characters, and display just the text characters. That's a quick-and-dirty
> strategy that might help.
>
> PDF is (I believe) a compressed, binary format of PS. Perhaps you should
> look at the program pdf2ps -- maybe it will help.
>
> If you explain your needs in a little more detail, perhaps people can give
> you answers which are a little more helpful.
> 
> 
> 
> -- 
> Steven.




More information about the Python-list mailing list