aligning text with space-normalized text
Steven Bethard
steven.bethard at gmail.com
Wed Jun 29 20:50:40 EDT 2005
I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find. However, the chunks have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()). Some
example data to clarify my problem:
py> text = """\
... aaa bb ccc
... dd eee. fff gggg
... hh i.
... jjj kk.
... """
py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
Note that the original "text" has a variety of whitespace between words,
but the corresponding "chunks" have only single space characters between
"words". I'm looking for the indices of each chunk, so for this
example, I'd like:
py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
Note that the indices correspond to the *original* text so that the
substrings in the given spans include the irregular whitespace:
py> for s, e in result:
... print repr(text[s:e])
...
'aaa bb'
'ccc\ndd eee.'
'fff gggg\nhh i.'
'jjj'
'kk.'
I'm trying to write code to produce the indices. Here's what I have:
py> def get_indices(text, chunks):
... chunks = iter(chunks)
... chunk = None
... for text_index, c in enumerate(text):
... if c.isspace():
... continue
... if chunk is None:
... chunk = chunks.next().replace(' ', '')
... chunk_start = text_index
... chunk_index = 0
... if c != chunk[chunk_index]:
... raise Exception('unmatched: %r %r' %
... (c, chunk[chunk_index]))
... else:
... chunk_index += 1
... if chunk_index == len(chunk):
... yield chunk_start, text_index + 1
... chunk = None
...
And it appears to work:
py> list(get_indices(text, chunks))
[(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
py> list(get_indices(text, chunks)) == result
True
But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?
Thanks in advance,
STeVe
[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions. ;)
More information about the Python-list
mailing list