converting text and spans to an ElementTree
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Tue May 22 04:51:43 EDT 2007
En Tue, 22 May 2007 03:02:34 -0300, Steven Bethard
<steven.bethard at gmail.com> escribió:
> I have some text and a list of Element objects and their offsets, e.g.::
>
> >>> text = 'aaa aaa aaabbb bbbaaa'
> >>> spans = [
> ... (etree.Element('a'), 0, 21),
> ... (etree.Element('b'), 11, 18),
> ... (etree.Element('c'), 18, 18),
> ... ]
>
> I'd like to produce the corresponding ElementTree. So I want to write a
> get_tree() function that works like::
>
> >>> tree = get_tree(text, spans)
> >>> etree.tostring(tree)
> '<a>aaa aaa aaa<b>bbb bbb<c /></b>aaa</a>'
>
> Perhaps I just need some more sleep, but I can't see an obvious way to
> do this. Any suggestions?
I need *some* sleep, but the idea would be as follows:
- For each span generate two tuples: (start_offset, 1, end_offset,
element) and (end_offset, 0, -start_offset, element). If start==end use
(start_offset, -1, start_offset, element).
- Collect all tuples in a list, and sort them. The tuple is made so when
at a given offset there is an element that ends and another that starts,
the ending element comes first (because it has a 0). For all the elements
that end at a given point, the shortest comes first.
- Initialize an empty list (will be used as a stack of containers), and
create a root Element as your "current container" (CC), the variable
"last_used" will keep the last position used on the text.
- For each tuple in the sorted list:
- if the second item is a 1, an element is starting. Insert it into the
CC element, push the CC onto the stack, and set the new element as the new
CC. The element data is text[last_used:start_offset], and move last_used
to start_offset.
- if the second item is a 0, an element is ending. Discard the CC, pop
an element from the stack as the new CC. The element data is
text[last_used:end_offset], move last_used up to end_offset.
- if the second item is a -1, it's an element with no content. Insert it
into the CC element.
You can play with the way the tuples are generated and sorted, to get
'<a>aaa aaa aaa<b>bbb bbb</b><c />aaa</a>' instead.
--
Gabriel Genellina
More information about the Python-list
mailing list