grouping subsequences with BIO tags

Thu Apr 21 17:37:03 EDT 2005

I have a list of strings that looks something like:
     ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules 
based on the string prefix:
     'O' is discarded
     'B_...' starts a new run
     'I_...' continues a run started by a 'B_...'
So, the example above should look like:
     [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I 
check for errors as well.  'I_...' must always follow 'B_...', so errors 
look like:
     ['O', 'I_...']
     ['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the 
'B_...' is different from the suffix of the 'I_...'.

This is the best I've come up with so far:

py> class K(object):
...     def __init__(self):
...         self.last_result = False
...         self.last_label = 'O'
...     def __call__(self, label):
...         if label[:2] in ('O', 'B_'):
...             self.last_result = not self.last_result
...         elif self.last_label[2:] != label[2:]:
...             raise ValueError('%s followed by %s' %
...                              (self.last_label, label))
...         self.last_label = label
...         return self.last_result
...
py> def get_runs(lst):
...     for _, item in itertools.groupby(lst, K()):
...         result = list(item)
...         if result != ['O']:
...             yield result
...
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
   ...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
   ...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?

STeVe