aligning a set of word substrings to sentence

Thu Dec 1 14:12:03 EST 2005

I've got a list of word substrings (the "tokens") which I need to align 
to a string of text (the "sentence").  The sentence is basically the 
concatenation of the token list, with spaces sometimes inserted beetween 
tokens.  I need to determine the start and end offsets of each token in 
the sentence.  For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Here's my current definition of the offsets function::

py> def offsets(tokens, text):
...     start = 0
...     for token in tokens:
...         while text[start].isspace():
...             start += 1
...         text_token = text[start:start+len(token)]
...         assert text_token == token, (text_token, token)
...         yield start, start + len(token)
...         start += len(token)
...

I feel like there should be a simpler solution (maybe with the re 
module?) but I can't figure one out.  Any suggestions?

STeVe