[XML-SIG] Using xpath/xslt on proprietary object structures.

Xhaus Main Account pyxml@xhaus.com
Fri, 21 Sep 2001 09:01:19 +0100


Thanks for the tips Alexander.

However, from what I can see, pDomlette and cDomlette are not suitable for my
needs due to incomplete DOM support (I'll briefly explain my app below).

There is no support for "geteElementsByTagName" and
"getElementsByTagNameNS", which I use. That's not a major problem, since
I can use xpath expressions to replace them.

However, I do need support for "importNode". I can see that pDomlette does not
have it, and I think cDomlette doesn't have it either (I checked by doing
"strings cDomlettec.pyd | grep importNode". Am I wrong here?)

Brief summary of the app:

I have ~250 members of a scientific association, all of whom have contact
details, and lists of publications. I have to generate a home page for each
member, "directory lists" for all members by name, by country (of which there
are 24) and by "Research speciality", of which there 9.

Each member has at least two files.

1. A "wrapper" file

<MEMWRAP id="collins">

<xlink:xlink
 xmlns:xlink="http://www.w3.org/1999/xlink"
 xlink:href="members.dat/collins_dat.xml"
 xlink:title="Member data"
 xlink:role="memData"
 xlink:show="transclude"/>

<xlink:xlink
 xmlns:xlink="http://www.w3.org/1999/xlink"
 xlink:href="members.ref/collins_iap.xml"
 xlink:title="Member IAP references"
 xlink:role="pubData"
 xlink:show="transclude"/>

<!-- Further xlinks not included    -->

</MEMWRAP>

2. A data file

<?xml version="1.0" encoding="iso-8859-1"?>
<MEMBER id="collins">

<TITLE-ABBREV>Dr.</TITLE-ABBREV>
<FIRST>Michael</FIRST>
<MIDDLE>T.</MIDDLE>
<LAST>Collins</LAST>
<OFFICE>President</OFFICE>
<ORGANIZATION>University of Wisconsin</ORGANIZATION>
<ADDRESS>
<LINE1>School of Veterinary Medicine</LINE1>
<STATE id="Wisconsin">Wisconsin 53706-1102</STATE>
<COUNTRY id="us">USA</COUNTRY>
</ADDRESS>
<PHONE>(608) 262-8457</PHONE><FAX>(608) 265-6463</FAX>
<EMAIL>mcollin5@facstaff.wisc.edu</EMAIL>
<WWW>http://www.vetmed.wisc.edu/pbs/johnes/mtc.html</WWW>
</MEMBER>

3. And possibly some files for publications lists

<?xml version="1.0" encoding="iso-8859-1"?>
<PUBS-IAP>

<EDITORPUB>
<AUTHORS>
<IAPAUTHOR id="chiodini">Chiodini RJ</IAPAUTHOR>
<IAPAUTHOR id="collins">Collins MT</IAPAUTHOR>
<AUTHOR>Bassey EOE</AUTHOR>
</AUTHORS>
<YEAR>1995</YEAR>
<IAPJOURNAL vol="proc4">Proceedings of the Fourth International Colloquium on
Paratuberculosis.</IAPJOURNAL><STATUS>397 pp.</STATUS>
</EDITORPUB>

<xlink:xlink
 xmlns:xlink="http://www.w3.org/1999/xlink"
 xlink:href="../../proc6.dat/abst4_30.xml"
 xlink:title="Member reference"
 xlink:role="pubData" xlink:show="transclude"/>

<xlink:xlink
 xmlns:xlink="http://www.w3.org/1999/xlink"
 xlink:href="../../proc6.dat/abst4_29.xml"
 xlink:title="Member reference"
 xlink:role="pubData" xlink:show="transclude"/>

</PUBS-IAP>

<!-- Lots more xlinks excluded    -->

When the member wrapper file is loaded, the xlinks are processed, and any
"transclude" xlinks (non-standard, I know) are replaced with the DOM tree of
the file they refer to. The resulting DOM is then processed with XSLT to
generate a "member home page", like this

http://www.paratuberculosis.org/members/collins.htm

The publications xlinks point to files like this

<?xml version="1.0" encoding="iso-8859-1"?>
<REFERENCE vol="proc6" volname="6th International Colloquium on
Paratuberculosis" section="4" sectname="Diagnostic Applications And
Approaches"  id="abst4_30">
<IAPJOURNAL vol="proc6">Proc. 6th Intl. Coll. Paratuberculosis: Manning EJB,
Collins MT (eds)</IAPJOURNAL>
<TITLE>Modification of a bovine ELISA to detect camelid antibodies to
Mycobacterium paratuberculosis.</TITLE>
<YEAR>1999</YEAR>
<AUTHORS>
<AUTHOR>Kramsky JA<NOTE>1</NOTE></AUTHOR>
<IAPAUTHOR id="miller">Miller DS<NOTE>2</NOTE></IAPAUTHOR>
<IAPAUTHOR id="hope">Hope A<NOTE>3</NOTE></IAPAUTHOR>
<IAPAUTHOR id="collins">Collins MT<NOTE>1</NOTE></IAPAUTHOR>
</AUTHORS>
<COUNTRY>Australia</COUNTRY>
<ABSTRACT>
Mycobacterium paratuberculosis infection, or Johne's disease, has a reportedly
low prevalence in South American camelid populations. ..............
<!-- Remainder of abstract not included    -->
</ABSTRACT>
</REFERENCE>

When it appears on the member home page, only the summary details of the
abstract are extracted. When the entire abstract is to used to generate an
"abstract page", like this one

http://www.paratuberculosis.org/proc6/abst4_30.htm

I use another wrapper file such as

<?xml version="1.0" encoding="UTF-8"?>
<REFWRAP>
<xlink:xlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:role="pubData"
xlink:show="transclude" xlink:href="../proc6.dat/abst4_30.xml"/>
</REFWRAP>

And so on and so on. I also generate a lot of files from scratch, by creating
DOMs and then populating them with the data from the members data files. The
resulting DOMs are then processed with XSLT to generate "directory listings",
such as

http://www.paratuberculosis.org/member/byname.htm
http://www.paratuberculosis.org/country/us.htm
http://www.paratuberculosis.org/research/crohns.htm

When I started this years ago, before I discovered Python, I used to use a
command line XSLT processsor (XT) and character entities to manage the whole
network of files. However, there was too much manual maintenance of data
files, so when I started using Python, I automated the maintenance using DOM.

It worked fine until the membership of the association grew, and now I'm
finding that I have to break the operations up and do them separately in
separate runs, because otherwise I run out of memory.

To summarise: I'm doing a lot of DOM manipulations. I cut branches off DOMs,
tack new branches onto DOMs, and generate a lot of DOMs from scratch, and
populate them with the data from other DOMs.

I'll have a closer look at the code for pDomlette, and see if I can hack some
support for importNode into it, which would solve my problem.

Cheers,

Alan.

Alexandre Fayolle wrote:

> Hello,
>
> >From the description of your situation, I think that using 4SuiteServer
> could be a great help. I believe 4SS supports XLink. Depending on the use
> you make of XPath, this can be taken care of by the RDF-based indexing
> feature of 4XSLT.
>
> You don't give much details on what your application is made of. However,
> if you're using 4DOM, try switching to pDomlette (or even to cDomlette),
> which are both more lightweight and faster implementations of DOM (though
> not as compliant as 4DOM)