grouping subsequences with BIO tags

Thu Apr 21 18:30:00 EDT 2005

Steven Bethard wrote:
> I have a list of strings that looks something like:
>     ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

I'd have done it the same way as you, but here's 'another' way:

  >>> def grp(lst):
  ...     stack = []
  ...     for label in lst:
  ...         prefix = label[0]
  ...         if prefix == 'B':
  ...             group = [label]
  ...             stack.append(group)
  ...         elif prefix == 'I':
  ...             if group[0][2:] != label[2:]:
  ...                raise ValueError('%s followed by %s' %
  ...                              (group[0], label))
  ...             group.append(label)
  ...         elif prefix == 'O':
  ...             group = [label]
  ...     return stack
  ...
  >>>

  >>> grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X'])
  [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
  >>>
  >>> grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'O', 'I_X', 'B_X'])
  Traceback (most recent call last):
    File "<input>", line 1, in ?
    File "\\CC1040907-A\MichaelDocuments\PyDev\Junk\BIO.py", line 32, in grp
      raise ValueError('%s followed by %s' %
  ValueError: O followed by I_X

Michael