how to split this kind of text into sections

oyster lepto.python at gmail.com
Sat Apr 26 11:53:14 EDT 2014


First of all, thank you all for your answers. I received python
mail-list in a daily digest, so it is not easy for me to quote your
mail separately.

I will try to explain my situation to my best, but English is not my
native language, I don't know whether I can make it clear at last.

Every SECTION starts with 2 special lines; these 2 lines is special
because they have some same characters (the length is not const for
different section) at the beginning; these same characters is called
the KEY for this section. For every 2 neighbor sections, they have
different KEYs.

After these 2 special lines, some paragraph is followed. Paragraph
does not have any KEYs.

So, a section = 2 special lines with KEYs at the beginning + some
paragraph without KEYs

However there maybe some paragraph before the first section, which I
do not need and want to drop it

I need a method to split the whole text into SECTIONs and to know all the KEYs

I have tried to solve this problem via re module, but failed. Maybe I
can make you understand me clearly by showing the regular expression
object
reobj = re.compile(r"(?P<bookname>[^\r\n]*?)[^\r\n]*?\r\n(?P=bookname)[^\r\n]*?\r\n.*?",
re.DOTALL)
which can get the first 2 lines of a section, but fail to get the rest
of a section which does not have any KEYs at the begin. The hard part
for me is to express "paragraph does not have KEYs".

Even I can get the first 2 line, I think regular expression is
expensive for my text.

That is all. I hope get some more suggestions. Thanks.

[demo text starts]
a line we do not need
I am section axax
I am section bbb
(and here goes many other text)...

let's continue to
let's continue, yeah
.....(and here goes many other text)...

I am using python
I am using perl
.....(and here goes many other text)...

Programming is hard
Programming is easy
How do you thing?
I do’t know
[demo text ends]

the above text should be splited to a LIST with 4 items, and I also
need to know the KEY for LIST is ['I am section ', 'let's continue',
'I am using ', ' Programming is ']:
lst=[
'''a line we do not need
I am section axax
I am section bbb
(and here goes many other text)... ''',

'''let's continue to
let's continue, yeah
.....(and here goes many other text)... ''',

'''I am using python
I am using perl
.....(and here goes many other text)... ''',

'''Programming is hard
Programming is easy
How do you thing?
I do’t know'''
]



More information about the Python-list mailing list