[XML-SIG] PyXML XPath woes

Matt Patterson list-matt at reprocessed.org
Wed Feb 11 09:35:28 EST 2004


On 8 Feb 2004, at 02:49, Thomas B. Passin wrote:

> Matt Patterson wrote:
>> I've got an XML file in which I want to locate all elements with the 
>> attribute boundary set to 'true'. I use the following XPath with 
>> 4DOM:
>> //*[@boundary='true']
>> like so:
>> boundaryFinder = Compile("//*[@boundary='true']")
>> context = Context(self.document)
>> # evaluate the expression and get a nodeList
>> boundaryNodes = boundaryFinder.evaluate(context)
>> But the results of the XPath do not return all the nodes which match!
>
> How many nodes did you get and how many are actually there?

Okay, this is weird: the XPath _is_ returning 47 nodes, as it should. I 
should have checked closer: I thought  that the XPath was returning 
wrongly because in the final paginated output were pages which had 
several elements with boundary="true" attributes, but I panicked and 
assumed that XPath was to blame.

This seems to have been caused by heinous problems with the 4DOM DOM 
Range implementation: either the range is storing its boundary points 
very strangely, or the range.cloneContents() method is simply bonkers. 
The range causing the problem has start and end points with the same 
parent, and the cloneContents() method is returning that all of that 
parent node's children. I had to do some poking, but it seems that when 
the start point of a range is the child of the range's common ancestor 
node and the end point is a grand-child or greater descendant then the 
cloneContents() method of the range returns all preceding-siblings of 
the last part of the end-point's ancestor chain:

If the start and end-points of a range were set to <start/> and <end/> 
respectively below:

<root>
	<one>
		<A>
			<la/><la/><la/><la/><la/>
			<start/>
			<alpha>
				<un/><un/><un/><un/>
				<end/>
				<ti/><ti/><ti/><ti/>
			</alpha>
		</A>
	</one>
</root>

Then range.cloneContents() would return:

			<la/><la/><la/><la/><la/>
			<start/>
			<alpha>
				<un/><un/><un/><un/>
				<end/>
			</alpha>

Which is clearly bonkers.

> You have an encoding problem with the file you linked to.  It is 
> encoded in iso-8859-1 but with no encoding declaration it is treated 
> as utf-8. Unfortunately there are some non-utf-8 characters in it, so 
> it is not well-formed.  Thus any results you get would be suspect.  In 
> fact, it should not parse sucessfully at all

Hmmm. The Frame XML file (and thus it's entities) claim to be utf-8, 
and I've had no problems with them. It could be a file-transfer issue, 
I suppose. I'll have to investigate closer. Thanks for the heads up!

Thanks for all your help,

Matt


-- 
    Matt Patterson | Typographer
    <matt at emdash.co.uk> | http://www.emdash.co.uk/
    <matt at reprocessed.org> | http://reprocessed.org/




More information about the XML-SIG mailing list