aligning a set of word substrings to sentence

Fri Dec 2 11:34:47 EST 2005

Steven Bethard wrote:
> Michael Spencer wrote:
> 
>> Steven Bethard wrote:
>>
>>> I've got a list of word substrings (the "tokens") which I need to 
>>> align to a string of text (the "sentence").  The sentence is 
>>> basically the concatenation of the token list, with spaces sometimes 
>>> inserted beetween tokens.  I need to determine the start and end 
>>> offsets of each token in the sentence.  For example::
>>>
>>> py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
>>> py> text = '''\
>>> ... She's gonna write
>>> ... a book?'''
>>> py> list(offsets(tokens, text))
>>> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 
>>> 25)]
>>>
>>
> [snip]
> 
>>
>> and then, for an entry in the wacky category, a difflib solution:
>>
>>  >>> def offsets(tokens, text):
>>  ...     from difflib import SequenceMatcher
>>  ...     s = SequenceMatcher(None, text, "\t".join(tokens))
>>  ...     for start, _, length in s.get_matching_blocks():
>>  ...         if length:
>>  ...             yield start, start + length
>>  ...
>>  >>> list(offsets(tokens, text))
>>  [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 
>> 25)]
> 
> 
> That's cool, I've never seen that before.  If you pass in str.isspace, 
> you can even drop the "if length:" line::
> 
> py> def offsets(tokens, text):
> ...     s = SequenceMatcher(str.isspace, text, '\t'.join(tokens))
> ...     for start, _, length in s.get_matching_blocks():
> ...         yield start, start + length
> ...
> py> list(offsets(tokens, text))
> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 
> 25), (25, 25)]

Sorry, that should have been::
     list(offsets(tokens, text))[:-1]
since the last item is always the zero-length one.  Which means you 
don't really need str.isspace either.

STeVe