how to split this kind of text into sections

Fri Apr 25 10:14:28 EDT 2014

On 2014-04-25 23:31, Chris Angelico wrote:
> On Fri, Apr 25, 2014 at 11:07 PM, oyster <lepto.python at gmail.com>
> wrote:
> > the above text should be splitted as a LIST with 3 items, and I
> > also need to know the KEY for LIST is ['I am section', 'let's
> > continue', 'I am using']:
> 
> It's not perfectly clear, but I think I have some idea of what
> you're trying to do. Let me restate what I think you want, and you
> can tell be if it's correct.
> 
> You have a file which consists of a number of lines. Some of those
> lines begin with the string "I am section", others begin "let's
> continue", and others begin "I am using". You want to collect those
> three sets of lines; inside each collection, every line will have
> that same prefix.
> 
> Is that correct? If so, we can certainly help you with that. If not,
> please clarify. :)

My reading of it (and it took me several tries) was that two
subsequent lines would begin with the same N words.  Something like
the following regexp:

  ^(\w.{8,}).*\n\1.*

as the delimiter (choosing "6" arbitrarily as an indication of a
minimum match length to).

A naive (and untested) bit of code might look something like

  MIN_LEN = 6
  def overlap(s1, s2):
    chars = []
    for c1, c2 in zip(s1,s2):
      if c1 != c2: break
      chars.append(c1)
    return ''.join(chars)
  prevline = ""
  output_number = 1
  output = defaultdict(list)
  key = None
  with open("input.txt") as f:
    for line in f:
      if len(line) >= MIN_LEN and prevline[:MIN_LEN] == line[:MIN_LEN]: 
        key = overlap(prevline, line)
      output[key].append(line)
      prevline = line

There are some edge-cases such as when multiple sections are
delimited by the same overlap, but this should build up a defaultdict
keyed by the delimiters with the corresponding lines as the values.

-tkc