how to split this kind of text into sections

Tim Chase python.list at tim.thechases.com
Sat Apr 26 12:59:56 EDT 2014


On 2014-04-26 23:53, oyster wrote:
> I will try to explain my situation to my best, but English is not my
> native language, I don't know whether I can make it clear at last.

Your follow-up reply made much more sense and your written English is
far better than many native speakers'. :-)

> Every SECTION starts with 2 special lines; these 2 lines is special
> because they have some same characters (the length is not const for
> different section) at the beginning; these same characters is called
> the KEY for this section. For every 2 neighbor sections, they have
> different KEYs.

I suspect you have a minimum number of characters (or words) to
consider, otherwise a single character duplicated at the beginning of
the line would delimit a section, such as

 abcd
 afgh

because they share the commonality of an "a".  The code I provided
earlier should give you what you describe.  I've tweaked and tested,
and provided it below.  Note that I require a minimum overlap of 6
characters (MIN_LEN).  It also gathers the initial stuff (that you
want to discard) under the empty key, so you can either delete that,
or ignore it.

> I need a method to split the whole text into SECTIONs and to know
> all the KEYs
> 
> I have tried to solve this problem via re module

I don't think the re module will be as much help here.

-tkc


from collections import defaultdict
import itertools as it
MIN_LEN = 6
def overlap(s1, s2):
    "Given 2 strings, return the initial overlap between them"
    return ''.join(
        c1
        for c1, c2
        in it.takewhile(
            lambda pair: pair[0] == pair[1],
            it.izip(s1, s2)
            )
        )
prevline = "" # the initial key under which preamble gets stored
output = defaultdict(list)
key = None
with open("data.txt") as f:
    for line in f:
        if len(line) >= MIN_LEN and prevline[:MIN_LEN] == line[:MIN_LEN]:
            key = overlap(prevline, line)
        output[key].append(line)
        prevline = line
for k,v in output.items():
    print str(k).center(60,'=')
    print ''.join(v)








.



More information about the Python-list mailing list