[Tutor] mxTextTools help: find the deepest in nest lists

pan at uchicago.edu pan at uchicago.edu
Wed Dec 10 16:41:27 EST 2003


Hi Danny,

I did try the re approach earlier, but I got an "maximum recursion" 
error when the list gets complicated. I guess it is not good for 
large nested list. 

So I turned to mxTextTools, which I failed to finish learning it 
MANY MANY times before. But this time I decided to twist my brain 
a lot further to see if I can finally learn how it works. 

I am so interested because first of all it was said to be very fast,
which might be critical if I wanna make some programs to work in 
gene/genome analysis; and secondly, it uses the state machine concept, 
which is quite alian for a people like me with major in biology. I can 
sense it is powerful in the future application on our field, but i need 
to learn before I can decide how right (or wrong) I am.

The tagtable concept is quite different from re. Basically it is a
collection of 'states'. When parsing, the core engine reads through
the target text, check if the read text match a state. Defined in 
each state is a decision tree, which directs which state will become 
the next state when matched and which when failed. 

The mxTextTools itself is quite a briliant work, but the 
documentation --- IMHO ---- doesn't mean to let someone who doesn't 
already know about the concept understand. So it is almost impossible
to learn, especially for a beginner like me. 

And, so far, I can't find any tutorial detailed enough. Provided in
the chapter 4 of David Mertz's 'Text processing in Python' is a much
better documentation but a brief glance seems to get me an impression that 
the example given is too complicated for a beginner. But, this might change
after I have time to go over the entire mxTextTools part back and forth 
several times.

Anyway, thx for the help. If in the end I finally "twist my head" 
successfully and learn how it works, I might write a tutorial on
it.

pan


Quoting Danny Yoo <dyoo at hkn.eecs.berkeley.edu>:

> 
> 
> On Tue, 9 Dec 2003 pan at uchicago.edu wrote:
> 
> > I am learning how to use mxTextTools. I need help with
> > the following ...
> >
> > I wanna match the deepest lists in a nested list tree.
> > For example:
> >
> > The [2, 2a] and [3, 3b] in :
> >
> >    [[[1,[2, 2a]],[[3,3b],4]],6]
> >
> > The [a.1,b_2] and [d.3,e] in :
> >
> >    [[a.1,b_2],[[c,[d.3,e]],f]]
> >
> > The [a1,a2], [d.1, d.2] and [o3,o4] in :
> >
> >    [[[[[a1,a2],out],[d.1, d.2]],o2],[[o3,o4],o5]]
> >
> > How can I setup a tagtable for matching them ??
> 
> 
> Hi Pan,
> 
> I'm not familiar enough with mxTextTools's tag table stuff.  Just to make
> sure though: are you sure that a text-handling approach is best for this
> problem?  Is your input a originally a data structure, or a string?
> 
> 
> One way to approach this with regular expressions might be something like
> this:
> 
> ###
> >>> import re
> >>> def find_innermost_nodes(tree_string):
> ...     regex = re.compile(r"""\[        ## literal open brace
> ...                            [^\[]+    ## one or more non-brace
> ...                                      ## characters
> ...                            \]        ## literal close brace
> ...                       """, re.VERBOSE)
> ...     return regex.findall(tree_string)
> ...
> >>> find_innermost_nodes("[[[1,[2, 2a]],[[3,3b],4]],6]")
> ['[2, 2a]]', '[3,3b],4]],6]']
> ###
> 
> 
> 
> Ooops!  Forgot to make the search nongreedy:
> 
> ###
> >>> def find_innermost_nodes(tree_string):
> ...     regex = re.compile(r"""\[        ## literal open brace
> ...                            [^\[]+?   ## one or more non-brace
> ...                                      ## characters, nongreedily
> ...                            \]        ## literal close brace
> ...                       """, re.VERBOSE)
> ...     return regex.findall(tree_string)
> ...
> >>> find_innermost_nodes("[[[1,[2, 2a]],[[3,3b],4]],6]")
> ['[2, 2a]', '[3,3b]']
> ###
> 
> 
> That's better.  *grin* Aren't the tag tables similar to regular
> expressions?
> 
> 
> 
> 
> That being said, I think this approach is approaching the problem in a way
> that doesn't generalize well.  If we had some tree structure, I'd approach
> this with something like:
> 
> ###
> def is_innermost_node(node):
>     if is_leaf(node): return false
>     return is_leaf(left(node)) and is_leaf(right(node))
> 
> def find_innermost_nodes(node):
>     if is_leaf(node):
>         return []
>     if is_innermost_node(node):
>         return [node]
>     return (find_innermost_nodes(left(node)) +
>             find_innermost_nodes(right(node)))
> ###
> 
> assuming that we have some functions for seeing if a node is a leaf
> (is_leaf()), and well as ways of getting the left() and right() children
> of a tree node.
> 
> 
> Good luck to you!
> 





More information about the Tutor mailing list