grouping subsequences with BIO tags
Bengt Richter
bokr at oz.net
Fri Apr 22 01:03:07 EDT 2005
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard <steven.bethard at gmail.com> wrote:
>I have a list of strings that looks something like:
> ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>I need to group the strings into runs (lists) using the following rules
>based on the string prefix:
> 'O' is discarded
> 'B_...' starts a new run
> 'I_...' continues a run started by a 'B_...'
>So, the example above should look like:
> [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
>
>At the same time that I'm extracting the runs, it's important that I
>check for errors as well. 'I_...' must always follow 'B_...', so errors
>look like:
> ['O', 'I_...']
> ['B_xxx', 'I_yyy']
>where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
>'B_...' is different from the suffix of the 'I_...'.
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):
>>> L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>>> def get_runs(seq):
... subseq =[]
... curr = '<NO PRIOR ELEMENTS>'
... for latest in seq:
... curr, last = latest, curr
... if curr.startswith('B_'):
... if subseq: yield subseq
... subseq = [curr]
... elif curr.startswith('I_'):
... if (last[:2] not in ('B_', 'I_') or
... last[2:] != curr[2:]
... ): raise ValueError, '%r followed by %r'%(last, curr)
... subseq.append(curr)
... elif curr!='O':
... raise ValueError, 'Unrecognized element: %r' % curr
... if subseq: yield subseq
...
>>> list(get_runs(L))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
But note that I allowed multiple I_X, did you want to do that?
>>> list(get_runs('B_X I_X I_X'.split()))
[['B_X', 'I_X', 'I_X']]
Did you want all these "errors" caught?
>>> list(get_runs('B_X I_X ?_X'.split()))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 15, in get_runs
ValueError: Unrecognized element: '?_X'
>>> list(get_runs('I_X I_X ?_X'.split()))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: '<NO PRIOR ELEMENTS>' followed by 'I_X'
>>> list(get_runs('B_X I_Y ?_X'.split()))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: 'B_X' followed by 'I_Y'
Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)
>
>This is the best I've come up with so far:
>
>py> class K(object):
>... def __init__(self):
>... self.last_result = False
>... self.last_label = 'O'
>... def __call__(self, label):
>... if label[:2] in ('O', 'B_'):
>... self.last_result = not self.last_result
Thanks for nice reminder that groupby doesn't need distinct keys
for all groups, just different from last ;-)
>... elif self.last_label[2:] != label[2:]:
>... raise ValueError('%s followed by %s' %
>... (self.last_label, label))
>... self.last_label = label
>... return self.last_result
>...
>py> def get_runs(lst):
>... for _, item in itertools.groupby(lst, K()):
>... result = list(item)
>... if result != ['O']:
>... yield result
>...
>py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
>[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
>py> list(get_runs(['O', 'I_Y']))
>Traceback (most recent call last):
> ...
>ValueError: O followed by I_Y
>py> list(get_runs(['B_X', 'I_Y']))
>Traceback (most recent call last):
> ...
>ValueError: B_X followed by I_Y
>
>Can anyone see another way to do this?
>
Well, there's no end to "another way", but above
is my .02USD ;-)
Regards,
Bengt Richter
More information about the Python-list
mailing list