[XML-SIG] Learning to use elementtree

Wed Apr 2 20:39:50 CEST 2008

Cliff

This was very helpful, thank you. I have modified the code accordingly
and all is working as expected. I want to make one modification, but
seem to be having some problems with generalization of the code.

The current program operates as follows:

xmlReader.py
from xml.etree.ElementTree import ElementTree as ET

filename = raw_input("Please enter the AM XML file: ")
new_file = raw_input("Save this file as: ")

# create a new file defined by the user
f = open(new_file, 'w')

et = ET(file=filename)

for statentityref in \
et.findall('admin/responseanalyses/analysis/analysisdata/statentityref')
:
    for statval in statentityref.findall('statval'):
      print >> f, statentityref.attrib['id'], '\t',
statval.attrib['type'], '\t', statval.attrib['value']

f.close()

This is based on your recommendation and works smoothly. Now, in the xml
file (which I have again attached), there are other statistics nested
inside admin/responseanalyses/analysis/analysisdata/statentityref that I
want in addition to what is already being extracted.

For example, (see snippet of xml below) the current program above pulls
out the attributes for id = 13963, skips the information below it where
id = 0 or id =1 and then pulls out the information for id = 13962. My
goal is to extract the information where id = 0 or 1 in addition to the
attribute for id=13963.

- <statentityref id="13963" type="item">
  <statval type="OmitCount" value="0.000000" /> 
  <statval type="NotReachedCount" value="0.000000" /> 
  <statval type="PolyserialCorrelation" value="0.496309" /> 
  <statval type="AdjustedPolyserial" value="0.452588" /> 
  <statval type="AverageScore" value="0.981667" se="0.003874" /> 
  <statval type="StdevItemScore" value="0.134154" /> 
- <statentityref id="0.000000" type="itemscorept">
  <statval type="UncollapsedMeanScore" value="23.863636" se="2.039014"
/> 
  <statval type="ScorePtPct" value="0.018333" se="0.003874" /> 
  <statval type="ScorePtBiserial" value="-0.496309" /> 
  <statval type="ScorePtAdjBiserial" value="-0.452588" /> 
  </statentityref>
- <statentityref id="1.000000" type="itemscorept">
  <statval type="UncollapsedMeanScore" value="34.941426" se="0.256340"
/> 
  <statval type="ScorePtPct" value="0.981667" se="0.003874" /> 
  <statval type="ScorePtBiserial" value="0.496309" /> 
  <statval type="ScorePtAdjBiserial" value="0.452588" /> 
  </statentityref>
- <statentityref id="omit" type="itemscorept">
  <statval type="ScorePtPct" value="0.000000" /> 
  <statval type="ScorePtBiserial" value="-99999.990000" /> 
  <statval type="ScorePtAdjBiserial" value="-99999.990000" /> 
  </statentityref>
  </statentityref>
- <statentityref id="13962" type="item">
  <statval type="OmitCount" value="0.000000" /> 
  <statval type="NotReachedCount" value="0.000000" /> 
  <statval type="PolyserialCorrelation" value="0.484469" /> 
  <statval type="AdjustedPolyserial" value="0.425165" /> 
  <statval type="AverageScore" value="0.743333" se="0.012614" /> 
  <statval type="StdevItemScore" value="0.436794" /> 
- <statentityref id="0.000000" type="itemscorept">
  <statval type="UncollapsedMeanScore" value="29.305195" se="0.512618"
/> 
  <statval type="ScorePtPct" value="0.256667" se="0.012614" /> 
  <statval type="ScorePtBiserial" value="-0.484469" /> 
  <statval type="ScorePtAdjBiserial" value="-0.425165" /> 
  </statentityref>

The current output from xmlReader.py using the attached xml file looks
like this (for these two IDs)

13963 	OmitCount 	0.000000
13963 	NotReachedCount 	0.000000
13963 	PolyserialCorrelation 	0.496309
13963 	AdjustedPolyserial 	0.452588
13963 	AverageScore 	0.981667
13963 	StdevItemScore 	0.134154
13962 	OmitCount 	0.000000
13962 	NotReachedCount 	0.000000
13962 	PolyserialCorrelation 	0.484469
13962 	AdjustedPolyserial 	0.425165
13962 	AverageScore 	0.743333
13962 	StdevItemScore 	0.436794

What I would like, in addition to what is already extracted, would be
something like:

# This is already provided
13963 	OmitCount 	0.000000
13963 	NotReachedCount 	0.000000
13963 	PolyserialCorrelation 	0.496309
13963 	AdjustedPolyserial 	0.452588
13963 	AverageScore 	0.981667
13963 	StdevItemScore 	0.134154

# This is the info nested in id=13963 and would be new
# Note the dash 0 or 1 depending on which attribute provides the info

13963-0 UncollapsedMeanScore 23.863636 2.039014
13963-0 ScorePtPct 0.018333 0.003874

...

13963-1 UncollapsedMeanScore 34.941426 0.25634
13963-1 ScorePtPct 0.981667 0.003874

and so on for all items. My modifications to code are resulting in no
output being generated, so after quiete a few failures I would
appreciate any advice on this.

Thanks.

> -----Original Message-----
> From: J. Cliff Dyer [mailto:jcd at unc.edu] 
> Sent: Tuesday, April 01, 2008 8:58 AM
> To: Doran, Harold
> Cc: xml-sig at python.org
> Subject: Re: [XML-SIG] Learning to use elementtree
> 
> On Tue, 2008-04-01 at 08:35 -0400, Doran, Harold wrote:
> > David et al:
> > 
> > Attached is a sample xml file. Below is my python code. I am using 
> > python 2.5.2 on a Windows XP machine.
> > 
> > Test.py
> > from xml.etree.ElementTree import ElementTree as ET
> > 
> > # create a new file defined by the user f = open('output.txt', 'w')
> > 
> > et = ET(file='g:\python\ml\out_g3r_b2.xml')
> > 
> > for statentityref in
> > 
> et.findall('admin/responseanalyses/analysis/analysisdata/statentityref
> > ')
> > :
> >    for statval in
> > 
> et.findall('admin/responseanalyses/analysis/analysisdata/statentityref
> > /s
> > tatval'):
> >       print >> f, statentityref.attrib['id'], '\t',
> > statval.attrib['type'], '\t', statval.attrib['value']     
> >       
> > f.close()
> > 
> > If you run this you will see the output organized almost 
> exactly as I 
> > need it. But, there is a bug in my program, which I suspect 
> is in the 
> > order in which I am looping. For example, here is a snippet 
> of output 
> > from the file output.txt. I've added in some comments so 
> you can see 
> > where I am struggling.
> > 
> > 9568 	OmitCount 	0.000000 # This is correct
> > 9568 	NotReachedCount 	0.000000 # This is correct
> > 9568 	PolyserialCorrelation 	0.602525 # This is correct
> > 9568 	AdjustedPolyserial 	0.553564 # This is correct
> > 9568 	AverageScore 	0.817348 # This is correct
> > 9568 	StdevItemScore 	0.386381 # This is correct
> > 9568 	OmitCount 	0.000000 # This is NOT correct
> > 9568 	NotReachedCount 	0.000000 # This is NOT correct
> > 9568 	PolyserialCorrelation 	0.672088 # This is NOT correct
> > 9568 	AdjustedPolyserial 	0.590175 # This is NOT correct
> > 9568 	AverageScore 	1.034195 # This is NOT correct
> > 9568 	StdevItemScore 	0.926668 # This is NOT correct
> > 
> > Now, here is what *should* be returned. Note that I have manually 
> > changed the item id (the number preceding the text) to 
> 9569. The data 
> > are pulled in correctly, but for some reason I am not 
> looping properly 
> > to get the correct item ID to line up with its corresponding data.
> > 
> > 9568 	OmitCount 	0.000000
> > 9568 	NotReachedCount 	0.000000
> > 9568 	PolyserialCorrelation 	0.602525
> > 9568 	AdjustedPolyserial 	0.553564
> > 9568 	AverageScore 	0.817348
> > 9568 	StdevItemScore 	0.386381
> > 9569 	OmitCount 	0.000000    # Note the item ID 
> has been modified
> > here and below.
> > 9569 	NotReachedCount 	0.000000
> > 9569 	PolyserialCorrelation 	0.672088
> > 9569 	AdjustedPolyserial 	0.590175
> > 9569 	AverageScore 	1.034195
> > 9569 	StdevItemScore 	0.926668
> > 
> > Last, notice the portion of code
> > 
> > admin/responseanalyses/analysis/analysisdata/statentityref')
> > 
> > I know this is what to use only because I manually went through the 
> > xml file to examine its hierarchical structure. I assume 
> this is bad 
> > pratice. Is there a way to examine the parent-child structure of an 
> > XML file in python so I can see the hierarchical structure?
> > 
> > Thanks,
> > Harold
> 
> If you keep looking in your probably massive output file, 
> you'll also find the same results under 9569, 9567, 9571, and 
> all your other statentityrefs.  In the following code:
> 
> for statentityref in \
> et.findall('admin/responseanalyses/analysis/analysisdata/state
> ntityref'):
>    for statval in \
> et.findall('admin/responseanalyses/analysis/analysisdata/state
> ntityref/statval'):
>        print >> f, statentityref.attrib['id'], '\t', 
> statval.attrib['type'], \
>             '\t', statval.attrib['value']     
> 
> there is nothing limiting statval to within statentityref, so 
> for each statentityref, you get all the statvals from *every* 
> statentityref.  Try something like this:
> 
> for statentityref in \
> et.findall('admin/responseanalyses/analysis/analysisdata/state
> ntityref'):
>     for statval in statentityref.findall('statval'):
>         do(stuff)
> 
> Note that now the xpath from which you get statval is limited 
> to searching within the current statentityref, and takes that 
> statentityref as its context node.
> 
> Or, if you want to shorten up your code lines a bit, break 
> out part of your xpath.
> 
> analysisdata = 
> et.findall('admin/responeanalyses/analysis/analysisdata')
> for statentityref in analysisdata.findall('statentityref'):
>     for statval in statentityref.findall('statval'):
>         do(stuff)
> 
> 
> Cheers,
> Cliff
> 
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: out_g4r_b.xml
Type: text/xml
Size: 66713 bytes
Desc: out_g4r_b.xml
Url : http://mail.python.org/pipermail/xml-sig/attachments/20080402/ffd03348/attachment-0001.bin