aligning text with space-normalized text

Peter Otten __peter__ at web.de
Thu Jun 30 03:07:04 EDT 2005


Steven Bethard wrote:

> I have a string with a bunch of whitespace in it, and a series of chunks
> of that string whose indices I need to find.  However, the chunks have
> been whitespace-normalized, so that multiple spaces and newlines have
> been converted to single spaces as if by ' '.join(chunk.split()).  Some

If you are willing to get your hands dirty with regexps:

import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
    lumps = _reLump.finditer(text)
    for chunk in chunks:
        lump = [lumps.next() for _ in chunk.split()]
        yield lump[0].start(), lump[-1].end()


def main():
    text = """\
   aaa  bb ccc
dd eee.  fff gggg
hh   i.
   jjj kk.
"""
    chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
    assert list(indices(text, chunks)) == [(3, 10), (11, 22), (24, 40), (44,
47), (48, 51)]

if __name__ == "__main__":
    main()

Not tested beyond what you see.

Peter




More information about the Python-list mailing list