grouping subsequences with BIO tags
Steven Bethard
steven.bethard at gmail.com
Thu Apr 21 17:37:03 EDT 2005
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.
This is the best I've come up with so far:
py> class K(object):
... def __init__(self):
... self.last_result = False
... self.last_label = 'O'
... def __call__(self, label):
... if label[:2] in ('O', 'B_'):
... self.last_result = not self.last_result
... elif self.last_label[2:] != label[2:]:
... raise ValueError('%s followed by %s' %
... (self.last_label, label))
... self.last_label = label
... return self.last_result
...
py> def get_runs(lst):
... for _, item in itertools.groupby(lst, K()):
... result = list(item)
... if result != ['O']:
... yield result
...
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y
Can anyone see another way to do this?
STeVe
More information about the Python-list
mailing list