aligning a set of word substrings to sentence

Fri Dec 2 08:26:14 EST 2005

Steven Bethard wrote:

>>> I feel like there should be a simpler solution (maybe with the re
>>> module?) but I can't figure one out.  Any suggestions?
>>
>> using the finditer pattern I just posted in another thread:
>>
>> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
>> text = '''\
>> She's gonna write
>> a book?'''
>>
>> import re
>>
>> tokens.sort() # lexical order
>> tokens.reverse() # look for longest match first
>> pattern = "|".join(map(re.escape, tokens))
>> pattern = re.compile(pattern)
>>
>> I get
>>
>> print [m.span() for m in pattern.finditer(text)]
>> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>>
>> which seems to match your version pretty well.
>
> That's what I was looking for.  Thanks!

except that I misread your problem statement; the RE solution above allows the
tokens to be specified in arbitrary order.  if they've always ordered, you can re-
place the code with something like:

    # match tokens plus optional whitespace between each token
    pattern = "\s*".join("(" + re.escape(token) + ")" for token in tokens)
    m = re.match(pattern, text)
    result = (m.span(i+1) for i in range(len(tokens)))

which is 6-7 times faster than the previous solution, on my machine.

</F>