[XML-SIG] Learning to use elementtree

Tue Apr 1 14:58:18 CEST 2008

On Tue, 2008-04-01 at 08:35 -0400, Doran, Harold wrote:
> David et al:
> 
> Attached is a sample xml file. Below is my python code. I am using
> python 2.5.2 on a Windows XP machine.
> 
> Test.py
> from xml.etree.ElementTree import ElementTree as ET
> 
> # create a new file defined by the user
> f = open('output.txt', 'w')
> 
> et = ET(file='g:\python\ml\out_g3r_b2.xml')
> 
> for statentityref in
> et.findall('admin/responseanalyses/analysis/analysisdata/statentityref')
> :
>    for statval in
> et.findall('admin/responseanalyses/analysis/analysisdata/statentityref/s
> tatval'):
>       print >> f, statentityref.attrib['id'], '\t',
> statval.attrib['type'], '\t', statval.attrib['value']     
>       
> f.close()
> 
> If you run this you will see the output organized almost exactly as I
> need it. But, there is a bug in my program, which I suspect is in the
> order in which I am looping. For example, here is a snippet of output
> from the file output.txt. I've added in some comments so you can see
> where I am struggling.
> 
> 9568 	OmitCount 	0.000000 # This is correct
> 9568 	NotReachedCount 	0.000000 # This is correct
> 9568 	PolyserialCorrelation 	0.602525 # This is correct
> 9568 	AdjustedPolyserial 	0.553564 # This is correct
> 9568 	AverageScore 	0.817348 # This is correct
> 9568 	StdevItemScore 	0.386381 # This is correct
> 9568 	OmitCount 	0.000000 # This is NOT correct
> 9568 	NotReachedCount 	0.000000 # This is NOT correct
> 9568 	PolyserialCorrelation 	0.672088 # This is NOT correct
> 9568 	AdjustedPolyserial 	0.590175 # This is NOT correct
> 9568 	AverageScore 	1.034195 # This is NOT correct
> 9568 	StdevItemScore 	0.926668 # This is NOT correct
> 
> Now, here is what *should* be returned. Note that I have manually
> changed the item id (the number preceding the text) to 9569. The data
> are pulled in correctly, but for some reason I am not looping properly
> to get the correct item ID to line up with its corresponding data.
> 
> 9568 	OmitCount 	0.000000
> 9568 	NotReachedCount 	0.000000
> 9568 	PolyserialCorrelation 	0.602525
> 9568 	AdjustedPolyserial 	0.553564
> 9568 	AverageScore 	0.817348
> 9568 	StdevItemScore 	0.386381
> 9569 	OmitCount 	0.000000    # Note the item ID has been modified
> here and below.
> 9569 	NotReachedCount 	0.000000
> 9569 	PolyserialCorrelation 	0.672088
> 9569 	AdjustedPolyserial 	0.590175
> 9569 	AverageScore 	1.034195
> 9569 	StdevItemScore 	0.926668
> 
> Last, notice the portion of code
> 
> admin/responseanalyses/analysis/analysisdata/statentityref')
> 
> I know this is what to use only because I manually went through the xml
> file to examine its hierarchical structure. I assume this is bad
> pratice. Is there a way to examine the parent-child structure of an XML
> file in python so I can see the hierarchical structure?
> 
> Thanks,
> Harold

If you keep looking in your probably massive output file, you'll also
find the same results under 9569, 9567, 9571, and all your other
statentityrefs.  In the following code:

for statentityref in \
et.findall('admin/responseanalyses/analysis/analysisdata/statentityref'):
   for statval in \
et.findall('admin/responseanalyses/analysis/analysisdata/statentityref/statval'):
       print >> f, statentityref.attrib['id'], '\t', statval.attrib['type'], \
            '\t', statval.attrib['value']     

there is nothing limiting statval to within statentityref, so for each
statentityref, you get all the statvals from *every* statentityref.  Try
something like this:

for statentityref in \
et.findall('admin/responseanalyses/analysis/analysisdata/statentityref'):
    for statval in statentityref.findall('statval'):
        do(stuff)

Note that now the xpath from which you get statval is limited to
searching within the current statentityref, and takes that statentityref
as its context node.

Or, if you want to shorten up your code lines a bit, break out part of
your xpath.

analysisdata = et.findall('admin/responeanalyses/analysis/analysisdata')
for statentityref in analysisdata.findall('statentityref'):
    for statval in statentityref.findall('statval'):
        do(stuff)

Cheers,
Cliff