[XML-SIG] Advice needed: RTF->XML conversions

Tony McDonald tony.mcdonald@ncl.ac.uk
Thu, 17 May 2001 12:05:39 +0100


On 17/5/01 9:24 am, "Mike Olson" <Mike.Olson@fourthought.com> wrote:

> Tony McDonald wrote:
>> 
>> On 17/5/01 8:49 am, "Alexandre Fayolle" <Alexandre.Fayolle@logilab.fr>
>> wrote:
>> 
>>> On Thu, 17 May 2001, Tony McDonald wrote:
>>> 
>>>> Can anyone suggest some (preferably python based) tools I can use to get
>>>> from Word RTF (or even, gasp, the 'XML' Word 2001 expels as it's HTML
>>>> pages)
>>>> to an XML form?
> 
> Can you send me a sample of the word XML output, and the format your
> looking for.  You can probably do it with a stylesheet as long as what
> word spits out really is XML.
> 
> Mike
> 

Thanks for the offer Mike - I *was* under the impression that what word spat
out was not real XML, but I found this (sorry, Java) based program;
http://www-uk.hpl.hp.com/people/fabgia/wh2fo/wh2fo.html
which generates XML and XSL from the html files that word 2000 generates.

Frankly, I'm amazed, as I thought that constructs such as

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta name=Title content="TRAUMA &amp; BURNS">
<meta name=Keywords content="">
<meta http-equiv=Content-Type content="text/html; charset=macintosh">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">
<link rel=File-List href="hepatology%20resource%20day_f_files/filelist.xml">
<link rel=Edit-Time-Data
href="hepatology%20resource%20day_f_files/editdata.mso">
<link rel=OLE-Object-Data
href="hepatology%20resource%20day_f_files/oledata.mso">

where attributes aren't quoted or, if they are, they're quoted with " or '
inconsistently, were very bad XML. I guess I was wrong.

I still need to do some work with the XML that the above program uses and
would like to use Python for that as I'm *far* more comfortable with it than
java.

If I've ready you right, are you saying I could apply a stylesheet to this
XML to get to my output XML which is then ok for my (finally!) python based
sgmlop processor that makes SQL? If so, I'll be very happy indeed!

Essentially I need to 'stack' the headings in the original document so that
this;

Heading 1 "Title"
    heading 2 "Overview"
        heading 3 "Core Content"
    heading 2 "Theme 1"

Goes to
<topic type="heading1" content="Title">
 <topic type="heading2" content="Overview">
  <topic type="heading3" content="Core Content">
  </topic>
 <topic type="heading2" content="Theme 1">
 </topic>
</topic>

If you're saying that I can use XSL stylesheets to get this to work, then I
need to do some reading!

Thanks for the comments,
Tone.
-- 
Dr Tony McDonald,  Assistant Director, FMCC, http://www.fmcc.org.uk/
The Medical School, Newcastle University Tel: +44 191 243 6140
A Zope list for UK HE/FE  http://www.fmcc.org.uk/mailman/listinfo/zope