aligning a set of word substrings to sentence

Thu Dec 1 19:09:44 EST 2005

Paul McGuire wrote:
> "Steven Bethard" <steven.bethard at gmail.com> wrote in message
> news:dpWdnfW-CeBj1xLeRVn-tw at comcast.com...
> 
>>I've got a list of word substrings (the "tokens") which I need to align
>>to a string of text (the "sentence").  The sentence is basically the
>>concatenation of the token list, with spaces sometimes inserted beetween
>>tokens.  I need to determine the start and end offsets of each token in
>>the sentence.  For example::
>>
>>py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
>>py> text = '''\
>>... She's gonna write
>>... a book?'''
>>py> list(offsets(tokens, text))
>>[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
> 
> ===================
> from pyparsing import oneOf
> 
> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
> text = '''\
> She's gonna write
> a book?'''
> 
> tokenlist = oneOf( " ".join(tokens) )
> offsets = [(start,end) for token,start,end in tokenlist.scanString(text) ]
> 
> print offsets
> ===================
> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Now that's a pretty solution. Three cheers for pyparsing! :)

STeVe