aligning a set of word substrings to sentence
Steven Bethard
steven.bethard at gmail.com
Fri Dec 2 11:34:47 EST 2005
Steven Bethard wrote:
> Michael Spencer wrote:
>
>> Steven Bethard wrote:
>>
>>> I've got a list of word substrings (the "tokens") which I need to
>>> align to a string of text (the "sentence"). The sentence is
>>> basically the concatenation of the token list, with spaces sometimes
>>> inserted beetween tokens. I need to determine the start and end
>>> offsets of each token in the sentence. For example::
>>>
>>> py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
>>> py> text = '''\
>>> ... She's gonna write
>>> ... a book?'''
>>> py> list(offsets(tokens, text))
>>> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
>>> 25)]
>>>
>>
> [snip]
>
>>
>> and then, for an entry in the wacky category, a difflib solution:
>>
>> >>> def offsets(tokens, text):
>> ... from difflib import SequenceMatcher
>> ... s = SequenceMatcher(None, text, "\t".join(tokens))
>> ... for start, _, length in s.get_matching_blocks():
>> ... if length:
>> ... yield start, start + length
>> ...
>> >>> list(offsets(tokens, text))
>> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
>> 25)]
>
>
> That's cool, I've never seen that before. If you pass in str.isspace,
> you can even drop the "if length:" line::
>
> py> def offsets(tokens, text):
> ... s = SequenceMatcher(str.isspace, text, '\t'.join(tokens))
> ... for start, _, length in s.get_matching_blocks():
> ... yield start, start + length
> ...
> py> list(offsets(tokens, text))
> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
> 25), (25, 25)]
Sorry, that should have been::
list(offsets(tokens, text))[:-1]
since the last item is always the zero-length one. Which means you
don't really need str.isspace either.
STeVe
More information about the Python-list
mailing list