Using python to convert PDF document to MSWord documents

Ksenia Marasanova ksenia at ksenia.nl
Tue Sep 28 17:30:23 EDT 2004


>> From: JEET <hjeet_in at yahoo.com>
>> Can anyone please suggest me if  there any python modules available to
>> convert PDF document to MSWord documents. If not then can you please
>> suggest how can i acheive this.
>
> No python modules, but:
> - feeding the subject line to google brings some sponsored links that 
> claim to solve your problem
> - http://www.quiss.org/swftools/ has a tool to convert PDF to Flash, 
> so there must be some code to detect Text, Fonts etc.
>

Pdf2swf is based on xpdf (http://www.foolabs.com/xpdf).
Another tool, that is also based on xpdf, is pdftohtml 
(http://pdftohtml.sourceforge.net/). It can convert pdf to html (using 
absolute CSS positioning) or to xml. I don't know if there is any rtf 
or Word writers in Python, but in the previous VB life I programmed a 
simple Word macro that would open HTML page and save it as .doc 
document. It was the most easy way to get all images embedded and 
formatting correctly done. Don't know, however, how it will handle 
absolute positioning.

Another possible option is to convert PDF to PS format, and than use 
pstoedit (http://www.pstoedit.net/pstoedit) with shareware RTF plugin 
mentioned on that page. Don't have any experience with this option.

Ksenia.




More information about the Python-list mailing list