grouping subsequences with BIO tags

Fri Apr 22 01:03:07 EDT 2005

On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard <steven.bethard at gmail.com> wrote:

>I have a list of strings that looks something like:
>     ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>I need to group the strings into runs (lists) using the following rules 
>based on the string prefix:
>     'O' is discarded
>     'B_...' starts a new run
>     'I_...' continues a run started by a 'B_...'
>So, the example above should look like:
>     [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
>
>At the same time that I'm extracting the runs, it's important that I 
>check for errors as well.  'I_...' must always follow 'B_...', so errors 
>look like:
>     ['O', 'I_...']
>     ['B_xxx', 'I_yyy']
>where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the 
>'B_...' is different from the suffix of the 'I_...'.

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):

 >>> L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
 >>> def get_runs(seq):
 ...     subseq =[]
 ...     curr = '<NO PRIOR ELEMENTS>'
 ...     for latest in seq:
 ...         curr, last = latest, curr
 ...         if curr.startswith('B_'):
 ...             if subseq: yield subseq
 ...             subseq = [curr]
 ...         elif curr.startswith('I_'):
 ...             if (last[:2] not in ('B_', 'I_') or
 ...                 last[2:] != curr[2:]
 ...             ): raise ValueError, '%r followed by %r'%(last, curr)
 ...             subseq.append(curr)
 ...         elif curr!='O':
 ...             raise ValueError, 'Unrecognized element: %r' % curr
 ...     if subseq: yield subseq
 ...
 >>> list(get_runs(L))
 [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

But note that I allowed multiple I_X, did you want to do that?
 >>> list(get_runs('B_X I_X I_X'.split()))
 [['B_X', 'I_X', 'I_X']]

Did you want all these "errors" caught?
 >>> list(get_runs('B_X I_X ?_X'.split()))
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "<stdin>", line 15, in get_runs
 ValueError: Unrecognized element: '?_X'
 >>> list(get_runs('I_X I_X ?_X'.split()))
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "<stdin>", line 12, in get_runs
 ValueError: '<NO PRIOR ELEMENTS>' followed by 'I_X'
 >>> list(get_runs('B_X I_Y ?_X'.split()))
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "<stdin>", line 12, in get_runs
 ValueError: 'B_X' followed by 'I_Y'

Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)

>
>This is the best I've come up with so far:
>
>py> class K(object):
>...     def __init__(self):
>...         self.last_result = False
>...         self.last_label = 'O'
>...     def __call__(self, label):
>...         if label[:2] in ('O', 'B_'):
>...             self.last_result = not self.last_result
Thanks for nice reminder that groupby doesn't need distinct keys
for all groups, just different from last ;-)

>...         elif self.last_label[2:] != label[2:]:
>...             raise ValueError('%s followed by %s' %
>...                              (self.last_label, label))
>...         self.last_label = label
>...         return self.last_result
>...
>py> def get_runs(lst):
>...     for _, item in itertools.groupby(lst, K()):
>...         result = list(item)
>...         if result != ['O']:
>...             yield result
>...
>py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
>[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
>py> list(get_runs(['O', 'I_Y']))
>Traceback (most recent call last):
>   ...
>ValueError: O followed by I_Y
>py> list(get_runs(['B_X', 'I_Y']))
>Traceback (most recent call last):
>   ...
>ValueError: B_X followed by I_Y
>
>Can anyone see another way to do this?
>
Well, there's no end to "another way", but above
is my .02USD ;-)

Regards,
Bengt Richter