converting text and spans to an ElementTree

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Tue May 22 04:51:43 EDT 2007


En Tue, 22 May 2007 03:02:34 -0300, Steven Bethard  
<steven.bethard at gmail.com> escribió:

> I have some text and a list of Element objects and their offsets, e.g.::
>
>      >>> text = 'aaa aaa aaabbb bbbaaa'
>      >>> spans = [
>      ...     (etree.Element('a'), 0, 21),
>      ...     (etree.Element('b'), 11, 18),
>      ...     (etree.Element('c'), 18, 18),
>      ... ]
>
> I'd like to produce the corresponding ElementTree. So I want to write a
> get_tree() function that works like::
>
>      >>> tree = get_tree(text, spans)
>      >>> etree.tostring(tree)
>      '<a>aaa aaa aaa<b>bbb bbb<c /></b>aaa</a>'
>
> Perhaps I just need some more sleep, but I can't see an obvious way to
> do this. Any suggestions?

I need *some* sleep, but the idea would be as follows:

- For each span generate two tuples: (start_offset, 1, end_offset,  
element) and (end_offset, 0, -start_offset, element). If start==end use  
(start_offset, -1, start_offset, element).
- Collect all tuples in a list, and sort them. The tuple is made so when  
at a given offset there is an element that ends and another that starts,  
the ending element comes first (because it has a 0). For all the elements  
that end at a given point, the shortest comes first.
- Initialize an empty list (will be used as a stack of containers), and  
create a root Element as your  "current container" (CC), the variable  
"last_used" will keep the last position used on the text.
- For each tuple in the sorted list:
   - if the second item is a 1, an element is starting. Insert it into the  
CC element, push the CC onto the stack, and set the new element as the new  
CC. The element data is text[last_used:start_offset], and move last_used  
to start_offset.
   - if the second item is a 0, an element is ending. Discard the CC, pop  
an element from the stack as the new CC. The element data is  
text[last_used:end_offset], move last_used up to end_offset.
   - if the second item is a -1, it's an element with no content. Insert it  
into the CC element.

You can play with the way the tuples are generated and sorted, to get  
'<a>aaa aaa aaa<b>bbb bbb</b><c />aaa</a>' instead.

-- 
Gabriel Genellina




More information about the Python-list mailing list