Getting elements and text with lxml

John Machin sjmachin at lexicon.net
Sat May 17 05:12:40 EDT 2008


J. Pablo Fernández wrote:
> On May 17, 2:19 am, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
> wrote:
>> En Fri, 16 May 2008 18:53:03 -0300, J. Pablo Fernández <pup... at pupeno.com>  
>> escribió:
>>
>>
>>
>>> Hello,
>>> I have an XML file that starts with:
>>> <vortaro>
>>> <art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
>>> <kap>
>>>   <ofc>*</ofc>-<rad>a</rad>
>>> </kap>
>>> out of it, I'd like to extract something like (I'm just showing one
>>> structure, any structure as long as all data is there is fine):
>>> [("ofc", "*"), "-", ("rad", "a")]
>>> How can I do it? I managed to get the content of boths tags and the
>>> text up to the first tag ("\n   "), but not the - (and in other XML
>>> files, there's more text outside the elements).
>> Look for the "tail" attribute.
> 
> That gives me the last part, but not the one in the middle:
> 
> In : etree.tounicode(e)
> Out: u'<kap>\n  <ofc>*</ofc>-<rad>a</rad>\n</kap>\n'
> 
> In : e.text
> Out: '\n  '
> 
> In : e.tail
> Out: '\n'
>

You need the text content of your initial element's children, which 
needs that of their children, and so on.

See http://effbot.org/zone/element-bits-and-pieces.htm

HTH,
John





More information about the Python-list mailing list