Html or Pdf to Rtf (Linux) with Python

Chas Emerick cemerick at snowtide.com
Wed Dec 15 17:50:21 EST 2004


I haven't seen any solid responses come across the wire, and I suspect 
there isn't a product or package that will do exactly what you want.

<blatent_self_promotion>
However, our company's product, PDFTextStream does do a phenomenal job 
of extracting text and metadata out of PDF documents.  It's crazy-fast, 
has a clean API, and in general gets the job done very nicely.  It 
presents two points of compromise from your idea situation:

1. It only produces text, so you would have to take the text it 
provides and write it out as an RTF yourself (there are tons of 
packages and tools that do this).  Since the RTF format has pretty weak 
formatting capabilities compared to PDF (and even compared to 
HTML+CSS), you'd likely never reproduce the original layout/content of 
the source document anyway.

2. It is a Java library.  You indicated in a later message that you 
were aiming to use a python package if possible just out of personal 
preference.  Assuming such a thing does not exist, and you are able to 
introduce a Java component to your project, this would become a 
non-issue.
</blatent_self_promotion>

Let me know what your questions are.

Chas Emerick
cemerick at snowtide.com
Snowtide Informatics Systems

PDFTextStream: fast PDF text extraction for Java apps and Lucene
http://snowtide.com/home/PDFTextStream/


Alexander Straschil wrote:
> Hello!
>
> I have to convert an HTML document to rtf with python, was just 
> googling
> for an hour and did find nothing ;-(
> Has anybody an Idea how to convert (under Linux)  an HTML or Pdf 
> Document
> to Rtf?
>
> Thanks, AXEL




More information about the Python-list mailing list