aligning a set of word substrings to sentence

Thu Dec 1 17:54:04 EST 2005

"Steven Bethard" <steven.bethard at gmail.com> wrote in message
news:dpWdnfW-CeBj1xLeRVn-tw at comcast.com...
> I've got a list of word substrings (the "tokens") which I need to align
> to a string of text (the "sentence").  The sentence is basically the
> concatenation of the token list, with spaces sometimes inserted beetween
> tokens.  I need to determine the start and end offsets of each token in
> the sentence.  For example::
>
> py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
> py> text = '''\
> ... She's gonna write
> ... a book?'''
> py> list(offsets(tokens, text))
> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>

Hey, I get the same answer with this:

===================
from pyparsing import oneOf

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

tokenlist = oneOf( " ".join(tokens) )
offsets = [(start,end) for token,start,end in tokenlist.scanString(text) ]

print offsets
===================
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Of course, pyparsing may be a bit heavyweight to drag into a simple function
like this, and certainly not near as fast as regexp.  But it was such a nice
way to show how scanString works.

Pyparsing's "oneOf" helper function takes care of the same longest match
issues that Fredrik Lundh handles using sort, reverse, etc.  Just so long as
none of the tokens has an embedded space character.

-- Paul