aligning a set of word substrings to sentence

Fri Dec 2 10:42:36 EST 2005

Michael Spencer wrote:
> Steven Bethard wrote:
> 
>> I've got a list of word substrings (the "tokens") which I need to 
>> align to a string of text (the "sentence").  The sentence is basically 
>> the concatenation of the token list, with spaces sometimes inserted 
>> beetween tokens.  I need to determine the start and end offsets of 
>> each token in the sentence.  For example::
>>
>> py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
>> py> text = '''\
>> ... She's gonna write
>> ... a book?'''
>> py> list(offsets(tokens, text))
>> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>>
>
[snip]
>
> and then, for an entry in the wacky category, a difflib solution:
>
>  >>> def offsets(tokens, text):
>  ...     from difflib import SequenceMatcher
>  ...     s = SequenceMatcher(None, text, "\t".join(tokens))
>  ...     for start, _, length in s.get_matching_blocks():
>  ...         if length:
>  ...             yield start, start + length
>  ...
>  >>> list(offsets(tokens, text))
>  [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

That's cool, I've never seen that before.  If you pass in str.isspace, 
you can even drop the "if length:" line::

py> def offsets(tokens, text):
...     s = SequenceMatcher(str.isspace, text, '\t'.join(tokens))
...     for start, _, length in s.get_matching_blocks():
...         yield start, start + length
...
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 
25), (25, 25)]

I think I'm going to have to take a closer look at 
difflib.SequenceMatcher; I have to do things similar to this pretty often...

STeVe