aligning SGML to text

Sun Jun 18 23:05:17 EDT 2006

Gerard Flanagan wrote:
> Steven Bethard wrote:
>> I have some plain text data and some SGML markup for that text that I
>> need to align.  (The SGML doesn't maintain the original whitespace, so I
>> have to do some alignment; I can't just calculate the indices directly.)
>>   For example, some of my text looks like:
>>
>> TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
>> cytoplasmic translocation and concomitant formation of an intracellular
>> signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.
>>
>> And the corresponding SGML looks like:
>>
>> <PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
>> </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
>> </PROTEIN> , resulting in cytoplasmic translocation and concomitant
>> formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
>> comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
>> <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
>>
>> Note that the SGML inserts spaces not only within the SGML elements, but
>> also around punctuation.
>>
>>
>> I need to determine the indices in the original text that each SGML
>> element corresponds to.  Here's some working code to do this, based on a
>> suggestion for a related problem by Fredrik Lundh[1]::
>>
>>      def align(text, sgml):
>>          sgml = sgml.replace('&', '&')
>>          tree = etree.fromstring('<xml>%s</xml>' % sgml)
>>          words = []
>>          if tree.text is not None:
>>              words.extend(tree.text.split())
>>          word_indices = []
>>          for elem in tree:
>>              elem_words = elem.text.split()
>>              start = len(words)
>>              end = start + len(elem_words)
>>              word_indices.append((start, end, elem.tag))
>>              words.extend(elem_words)
>>              if elem.tail is not None:
>>                  words.extend(elem.tail.split())
>>          expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
>>          match = re.match(expr, text)
>>          assert match is not None
>>          for word_start, word_end, label in word_indices:
>>              start = match.start(word_start + 1)
>>              end = match.end(word_end)
>>              yield label, start, end
>>
> [...]
>>      >>> list(align(text, sgml))
>>      [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
>>      ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
>>      ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]
>>
>> The problem is, this doesn't work when my text is long (which it is)
>> because regular expressions are limited to 100 groups.  I get an error
>> like::
> [...]
> 
> Steve
> 
> This is probably an abuse of itertools...
> 
> ---8<---
> text = '''TNF binding induces release of AIP1 (DAB2IP) from
> TNFR1, resulting in cytoplasmic translocation and concomitant
> formation of an intracellular signaling complex comprised of TRADD,
> RIP1, TRAF2, and AIPl.'''
> 
> sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of
> <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
> <PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
> and concomitant formation of an <PROTEIN> intracellular signaling
> complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
> <PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
> '''
> 
> import itertools as it
> import string
> 
> def scan(line):
>     if not line: return
>     line = line.strip()
>     parts = string.split(line, '>', maxsplit=1)
>     return parts[0]
> 
> def align(txt,sml):
>     i = 0
>     for k,g in it.groupby(sml.split('<'),scan):
>         g = list(g)
>         if not g[0]: continue
>         text = g[0].split('>')[1]#.replace('\n','')
>         if k.startswith('/'):
>             i += len(text)
>         else:
>             offset = len(text.strip())
>             yield k, i, i+offset
>             i += offset
> 
> print list(align(text,sgml))
> 
> ------------
> 
> [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
> ('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
> ('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]
> 
> It's off because of the punctuation possibly, can't figure it out.

Thanks for taking a look.  Yeah, the alignment's a big part of the 
problem.  It'd be really nice if the thing that gives me SGML didn't add 
whitespace haphazardly. ;-)

STeVe