lxml: traverse xml tree and retrieve element based on an attribute

Thu May 21 20:12:37 EDT 2009

On May 21, 6:57 pm, MRAB <goo... at mrabarnett.plus.com> wrote:
> byron wrote:
> > I am using the lxml.etree library to validate an xml instance file
> > with a specified schema that contains the data types of each element.
> > This is some of the internals of a function that extracts the
> > elements:
>
> >         schema_doc = etree.parse(schema_fn)
> >         schema = etree.XMLSchema(schema_doc)
>
> >         context = etree.iterparse(xml_fn, events=('start', 'end'),
> > schema=schema)
>
> >         # get root
> >         event, root = context.next()
>
> >         for event, elem in context:
> >             if event == 'end' and elem.tag == self.tag:
> >                 yield elem
> >             root.clear()
>
> > I retrieve a list of elements from this... and do further processing
> > to represent them in different ways. I need to be able to capture the
> > data type from the schema definition for each field in the element.
> > i.e.
>
> >     <xsd:element name="concept">
> >         <xsd:complexType>
> >             <xsd:sequence>
> >                 <xsd:element ref="foo"/>
> >                 <xsd:element name="concept_id" type="xsd:string"/>
> >                 <xsd:element name="line" type="xsd:integer"/>
> >                 <xsd:element name="concept_value" type="xsd:string"/>
> >                 <xsd:element ref="some_date"/>
> >             </xsd:sequence>
> >         </xsd:complexType>
> >     </xsd:element>
>
> > My thought is to recursively traverse through the schema definition
> > match the `name` attribute since they are unique to a `type` and
> > return that element. But I can't seem to make it quite work. All the
> > xml is valid, validation works, etc. This is what I have:
>
> >     def find_node(tree, name):
> >         for c in tree:
> >             if c.attrib.get('name') == name:
> >                 return c
> >             if len(c) > 0:
> >                 return find_node(c, name)
> >     return 0
>
> You're searching the first child and then returning the result, but what
> you're looking for might not be in the first child; if it's not then you
> need to search the next child:
>
>      def find_node(tree, name):
>          for c in tree:
>              if c.attrib.get('name') == name:
>                  return c
>              if len(c) > 0:
>                  r = find_node(c, name)
>                  if r:
>                      return r
>          return None
>
> > I may have been staring at this too long, but when something is
> > returned... it should be returned completely, no? This is what occurs
> > with `return find_node(c, name) if it returns 0. `return c` works
> > (used pdb to verify that), but the recursion continues and ends up
> > returning 0.
>
> > Thoughts and/or a different approach are welcome. Thanks
>
>

Thanks. Yes i tried something like this, but I think I overwrite `c`
when i wrote it, as in:

    if len(c) > 0:
        c = fin_node(c, name)
        if c is not None:
            return c

Thanks for you help.