grouping subsequences with BIO tags

Fri Apr 22 18:34:00 EDT 2005

Bengt Richter wrote:
> On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard <steven.bethard at gmail.com> wrote:
> 
>>I have a list of strings that looks something like:
>>    ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>>I need to group the strings into runs (lists) using the following rules 
>>based on the string prefix:
>>    'O' is discarded
>>    'B_...' starts a new run
>>    'I_...' continues a run started by a 'B_...'
>>So, the example above should look like:
>>    [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
>>
>>At the same time that I'm extracting the runs, it's important that I 
>>check for errors as well.  'I_...' must always follow 'B_...', so errors 
>>look like:
>>    ['O', 'I_...']
>>    ['B_xxx', 'I_yyy']
>>where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the 
>>'B_...' is different from the suffix of the 'I_...'.
> 
> With error checks on predecessor relationship,
> I think I'd do the whole thing in a generator,
> doing my own groupby as I went.
> E.g., see if this does what you want
> (slightly different error checking):
> 
>  >>> L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>  >>> def get_runs(seq):
>  ...     subseq =[]
>  ...     curr = '<NO PRIOR ELEMENTS>'
>  ...     for latest in seq:
>  ...         curr, last = latest, curr
>  ...         if curr.startswith('B_'):
>  ...             if subseq: yield subseq
>  ...             subseq = [curr]
>  ...         elif curr.startswith('I_'):
>  ...             if (last[:2] not in ('B_', 'I_') or
>  ...                 last[2:] != curr[2:]
>  ...             ): raise ValueError, '%r followed by %r'%(last, curr)
>  ...             subseq.append(curr)
>  ...         elif curr!='O':
>  ...             raise ValueError, 'Unrecognized element: %r' % curr
>  ...     if subseq: yield subseq
>  ...
>  >>> list(get_runs(L))
>  [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

Yeah, I started this route, and got confused by it.  Of course it makes 
perfect sense when someone writes you a working version. ;)  Thanks!
> 
> But note that I allowed multiple I_X, did you want to do that?
>  >>> list(get_runs('B_X I_X I_X'.split()))
>  [['B_X', 'I_X', 'I_X']]

Yeah, that's right.  Multiple 'I_...'s should be grouped together.

> Did you want all these "errors" caught?
>  >>> list(get_runs('B_X I_X ?_X'.split()))
>  Traceback (most recent call last):
>    File "<stdin>", line 1, in ?
>    File "<stdin>", line 15, in get_runs
>  ValueError: Unrecognized element: '?_X'
>  >>> list(get_runs('I_X I_X ?_X'.split()))
>  Traceback (most recent call last):
>    File "<stdin>", line 1, in ?
>    File "<stdin>", line 12, in get_runs
>  ValueError: '<NO PRIOR ELEMENTS>' followed by 'I_X'
>  >>> list(get_runs('B_X I_Y ?_X'.split()))
>  Traceback (most recent call last):
>    File "<stdin>", line 1, in ?
>    File "<stdin>", line 12, in get_runs
>  ValueError: 'B_X' followed by 'I_Y'
> 
> Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)

Yeah, those are the right errors.  I'll have to think about whether I 
should be trying to catch the [^BI]_ error.  It doesn't appear in my 
data now, but that doesn't mean it might not in the future.  Thanks!

STeVe