how to split this kind of text into sections

Sat Apr 26 22:49:31 EDT 2014

On Sat, 26 Apr 2014 23:53:14 +0800, oyster wrote:

> Every SECTION starts with 2 special lines; these 2 lines is special
> because they have some same characters (the length is not const for
> different section) at the beginning; these same characters is called the
> KEY for this section. For every 2 neighbor sections, they have different
> KEYs.
> 
> After these 2 special lines, some paragraph is followed. Paragraph does
> not have any KEYs.
> 
> So, a section = 2 special lines with KEYs at the beginning + some
> paragraph without KEYs
> 
> However there maybe some paragraph before the first section, which I do
> not need and want to drop it
> 
> I need a method to split the whole text into SECTIONs and to know all
> the KEYs

Let me try to describe how I would solve this, in English.

I would look at each pair of lines (1st + 2nd, 2nd + 3rd, 3rd + 4th, 
etc.) looking for a pair of lines with matching prefixes. E.g.:

"This line matches the next"
"This line matches the previous"

do match, because they both start with "This line matches the ".

Question: how many characters in common counts as a match?

"This line matches the next"
"That previous line matches this line"

have a common prefix of "Th", two characters. Is that a match?

So let me start with a function to extract the matching prefix, if there 
is one. It returns '' if there is no match, and the prefix (the KEY) if 
there is one:

def extract_key(line1, line2):
    """Return the key from two matching lines, or '' if not matching."""
    # Assume they need five characters in common.
    if line1[:5] == line2[:5]:
        return line1[:5]
    return ''

I'm pretty much guessing that this is how you decide there's a match. I 
don't know if five characters is too many or two few, or if you need a 
more complicated test. It seems that you want to match as many characters 
as possible. I'll leave you to adjust this function to work exactly as 
needed.

Now we iterate over the text in pairs of lines. We need somewhere to hold 
the the lines in each section, so I'm going to use a dict of lists of 
lines. As a bonus, I'm going to collect the ignored lines using a key of 
None. However, I do assume that all keys are unique. It should be easy 
enough to adjust the following to handle non-unique keys. (Use a list of 
lists, rather than a dict, and save the keys in a separate list.)

Lastly, the way it handles lines at the beginning of a section is not 
exactly the way you want it. This puts the *first* line of the section as 
the *last* line of the previous section. I will leave you to sort out 
that problem.

from collections import OrderedDict
section = []
sections = OrderedDict()
sections[None] = section
lines = iter(text.split('\n'))
prev_line = ''
for line in lines:
    key = extract_key(prev_line, line)
    if key == '':
        # No match, so we're still in the same section as before.
        section.append(line)
    else:
        # Match, so we start a new section.
        section = [line]
        sections[key] = section
    prev_line = line

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/