aligning SGML to text

Sun Jun 18 16:38:12 EDT 2006

Steven Bethard wrote:
> I have some plain text data and some SGML markup for that text that I
> need to align.  (The SGML doesn't maintain the original whitespace, so I
> have to do some alignment; I can't just calculate the indices directly.)
>   For example, some of my text looks like:
>
> TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
> cytoplasmic translocation and concomitant formation of an intracellular
> signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.
>
> And the corresponding SGML looks like:
>
> <PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
> </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
> </PROTEIN> , resulting in cytoplasmic translocation and concomitant
> formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
> comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
> <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
>
> Note that the SGML inserts spaces not only within the SGML elements, but
> also around punctuation.
>
>
> I need to determine the indices in the original text that each SGML
> element corresponds to.  Here's some working code to do this, based on a
> suggestion for a related problem by Fredrik Lundh[1]::
>
>      def align(text, sgml):
>          sgml = sgml.replace('&', '&')
>          tree = etree.fromstring('<xml>%s</xml>' % sgml)
>          words = []
>          if tree.text is not None:
>              words.extend(tree.text.split())
>          word_indices = []
>          for elem in tree:
>              elem_words = elem.text.split()
>              start = len(words)
>              end = start + len(elem_words)
>              word_indices.append((start, end, elem.tag))
>              words.extend(elem_words)
>              if elem.tail is not None:
>                  words.extend(elem.tail.split())
>          expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
>          match = re.match(expr, text)
>          assert match is not None
>          for word_start, word_end, label in word_indices:
>              start = match.start(word_start + 1)
>              end = match.end(word_end)
>              yield label, start, end
>
[...]
>      >>> list(align(text, sgml))
>      [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
>      ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
>      ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]
>
> The problem is, this doesn't work when my text is long (which it is)
> because regular expressions are limited to 100 groups.  I get an error
> like::
[...]

Steve

This is probably an abuse of itertools...

---8<---
text = '''TNF binding induces release of AIP1 (DAB2IP) from
TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.'''

sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of
<PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
<PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
and concomitant formation of an <PROTEIN> intracellular signaling
complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
<PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
'''

import itertools as it
import string

def scan(line):
    if not line: return
    line = line.strip()
    parts = string.split(line, '>', maxsplit=1)
    return parts[0]

def align(txt,sml):
    i = 0
    for k,g in it.groupby(sml.split('<'),scan):
        g = list(g)
        if not g[0]: continue
        text = g[0].split('>')[1]#.replace('\n','')
        if k.startswith('/'):
            i += len(text)
        else:
            offset = len(text.strip())
            yield k, i, i+offset
            i += offset

print list(align(text,sgml))

------------

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]

It's off because of the punctuation possibly, can't figure it out.
maybe you can tweak it?

hth

Gerard