aligning text with space-normalized text

Steven Bethard steven.bethard at gmail.com
Wed Jun 29 20:50:40 EDT 2005


I have a string with a bunch of whitespace in it, and a series of chunks 
of that string whose indices I need to find.  However, the chunks have 
been whitespace-normalized, so that multiple spaces and newlines have 
been converted to single spaces as if by ' '.join(chunk.split()).  Some 
example data to clarify my problem:

py> text = """\
...    aaa  bb ccc
... dd eee.  fff gggg
... hh   i.
...    jjj kk.
... """
py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']

Note that the original "text" has a variety of whitespace between words, 
but the corresponding "chunks" have only single space characters between 
"words".  I'm looking for the indices of each chunk, so for this 
example, I'd like:

py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]

Note that the indices correspond to the *original* text so that the 
substrings in the given spans include the irregular whitespace:

py> for s, e in result:
...     print repr(text[s:e])
...
'aaa  bb'
'ccc\ndd eee.'
'fff gggg\nhh   i.'
'jjj'
'kk.'

I'm trying to write code to produce the indices.  Here's what I have:

py> def get_indices(text, chunks):
...     chunks = iter(chunks)
...     chunk = None
...     for text_index, c in enumerate(text):
...         if c.isspace():
...             continue
...         if chunk is None:
...             chunk = chunks.next().replace(' ', '')
...             chunk_start = text_index
...             chunk_index = 0
...         if c != chunk[chunk_index]:
...             raise Exception('unmatched: %r %r' %
...                           (c, chunk[chunk_index]))
...         else:
...             chunk_index += 1
...             if chunk_index == len(chunk):
...                 yield chunk_start, text_index + 1
...                 chunk = None
...

And it appears to work:

py> list(get_indices(text, chunks))
[(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
py> list(get_indices(text, chunks)) == result
True

But it seems somewhat inelegant.  Can anyone see an easier/cleaner/more 
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms.  I'm looking for 
subjectively "better" solutions. ;)



More information about the Python-list mailing list