aligning a set of word substrings to sentence
Steven Bethard
steven.bethard at gmail.com
Thu Dec 1 14:12:03 EST 2005
I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::
py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
Here's my current definition of the offsets function::
py> def offsets(tokens, text):
... start = 0
... for token in tokens:
... while text[start].isspace():
... start += 1
... text_token = text[start:start+len(token)]
... assert text_token == token, (text_token, token)
... yield start, start + len(token)
... start += len(token)
...
I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?
STeVe
More information about the Python-list
mailing list