MS Word parser

Ben C spamspam at spam.eggs
Thu Jun 14 17:30:43 EDT 2007


On 2007-06-13, kenicheema at gmail.com <kenicheema at gmail.com> wrote:
> On Jun 13, 1:28 am, Tim Golden <m... at timgolden.me.uk> wrote:
>> keniche... at gmail.com wrote:
>> > Hi all,
>> > I'm currently using antiword to extract content from MS Word files.
>> > Is there another way to do this without relying on any command prompt
>> > application?
>>
>> Well you haven't given your environment, but is there
>> anything to stop you from controlling Word itself via
>> COM? I'm no Word expert, but looking around, this
>> seems to work:
>>
>> <code>
>> import win32com.client
>> word = win32com.client.Dispatch ("Word.Application")
>> doc = word.Documents.Open ("c:/temp/temp.doc")
>> text = doc.Range ().Text
>>
>> open ("c:/temp/temp.txt", "w").write (text.encode ("UTF-8"))
>> </code>
>>
>> TJG
>
> Tim,
> I'm on Linux (RedHat) so using Word is not an option for me.  Any
> other suggestions?

There is OpenOffice which has a Python API to it (called UNO). But
piping through antiword is probably easier.



More information about the Python-list mailing list