grouping subsequences with BIO tags
Steven Bethard
steven.bethard at gmail.com
Fri Apr 22 18:34:00 EDT 2005
Bengt Richter wrote:
> On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard <steven.bethard at gmail.com> wrote:
>
>>I have a list of strings that looks something like:
>> ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>>I need to group the strings into runs (lists) using the following rules
>>based on the string prefix:
>> 'O' is discarded
>> 'B_...' starts a new run
>> 'I_...' continues a run started by a 'B_...'
>>So, the example above should look like:
>> [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
>>
>>At the same time that I'm extracting the runs, it's important that I
>>check for errors as well. 'I_...' must always follow 'B_...', so errors
>>look like:
>> ['O', 'I_...']
>> ['B_xxx', 'I_yyy']
>>where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
>>'B_...' is different from the suffix of the 'I_...'.
>
> With error checks on predecessor relationship,
> I think I'd do the whole thing in a generator,
> doing my own groupby as I went.
> E.g., see if this does what you want
> (slightly different error checking):
>
> >>> L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
> >>> def get_runs(seq):
> ... subseq =[]
> ... curr = '<NO PRIOR ELEMENTS>'
> ... for latest in seq:
> ... curr, last = latest, curr
> ... if curr.startswith('B_'):
> ... if subseq: yield subseq
> ... subseq = [curr]
> ... elif curr.startswith('I_'):
> ... if (last[:2] not in ('B_', 'I_') or
> ... last[2:] != curr[2:]
> ... ): raise ValueError, '%r followed by %r'%(last, curr)
> ... subseq.append(curr)
> ... elif curr!='O':
> ... raise ValueError, 'Unrecognized element: %r' % curr
> ... if subseq: yield subseq
> ...
> >>> list(get_runs(L))
> [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
Yeah, I started this route, and got confused by it. Of course it makes
perfect sense when someone writes you a working version. ;) Thanks!
>
> But note that I allowed multiple I_X, did you want to do that?
> >>> list(get_runs('B_X I_X I_X'.split()))
> [['B_X', 'I_X', 'I_X']]
Yeah, that's right. Multiple 'I_...'s should be grouped together.
> Did you want all these "errors" caught?
> >>> list(get_runs('B_X I_X ?_X'.split()))
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "<stdin>", line 15, in get_runs
> ValueError: Unrecognized element: '?_X'
> >>> list(get_runs('I_X I_X ?_X'.split()))
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "<stdin>", line 12, in get_runs
> ValueError: '<NO PRIOR ELEMENTS>' followed by 'I_X'
> >>> list(get_runs('B_X I_Y ?_X'.split()))
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "<stdin>", line 12, in get_runs
> ValueError: 'B_X' followed by 'I_Y'
>
> Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)
Yeah, those are the right errors. I'll have to think about whether I
should be trying to catch the [^BI]_ error. It doesn't appear in my
data now, but that doesn't mean it might not in the future. Thanks!
STeVe
More information about the Python-list
mailing list