deleting texts between patterns

John Machin sjmachin at lexicon.net
Fri May 12 05:57:23 EDT 2006


On 12/05/2006 6:11 PM, Ravi Teja wrote:
> mickle... at hotmail.com wrote:
>> hi
>> say i have a text file
>>
>> line1
[snip]
>> line6
>> abc
>> line8 <---to be delete
[snip]
>> line13 <---to be delete
>> xyz
>> line15
[snip]
>> line18
>>
>> I wish to delete lines that are in between 'abc' and 'xyz' and print
>> the rest of the lines. Which is the best way to do it? Should i get
>> everything into a list, get the index of abc and xyz, then pop the
>> elements out? or any other better methods?
>> thanks
> 
> In other words ...
> lines = open('test.txt').readlines()
> for line in lines[lines.index('abc\n') + 1:lines.index('xyz\n')]:
>     lines.remove(line)

I don't think that's what you really meant.

 >>> lines = ['blah', 'fubar', 'abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']
 >>> for line in lines[lines.index('abc\n') + 1:lines.index('xyz\n')]:
...     lines.remove(line)
...
 >>> lines
['abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']

Uh-oh.

Try this:

 >>> lines = ['blah', 'fubar', 'abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']
 >>> del lines[lines.index('abc\n') + 1:lines.index('xyz\n')]
 >>> lines
['blah', 'fubar', 'abc\n', 'xyz\n', 'xyzzy']
 >>>

Of course wrapping it in try/except would be a good idea, not for the 
slicing, which behaves itself and does nothing if the 'abc\n' appears 
AFTER the 'xyz\n', but for the index() in case the sought markers aren't 
there. Perhaps it might be a good idea even to do it carefully one piece 
at a time: is the abc there? is the xyz there? is the xyz after the abc 
-- then del[index1+1:index2].

I wonder what the OP wants to happen in a case like this:

guff1 xyz guff2 abc guff2 xyz guff3
or this:
guff1 abc guff2 abc guff2 xyz guff3

> for line in lines:
>     print line,
> 
> Regular expressions are better in this case

Famous last words.

> import re
> pat = re.compile('abc\n.*?xyz\n', re.DOTALL)
> print re.sub(pat, '', open('test.txt').read())
> 

I don't think you really meant that either.

 >>> lines = ['blah', 'fubar', 'abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']
 >>> linestr = "".join(lines)
 >>> linestr
'blahfubarabc\nblahfubarxyz\nxyzzy'
 >>> import re
 >>> pat = re.compile('abc\n.*?xyz\n', re.DOTALL)
 >>> print re.sub(pat, '', linestr)
blahfubarxyzzy
 >>>

Uh-oh.

Try this:

 >>> pat = re.compile('(?<=abc\n).*?(?=xyz\n)', re.DOTALL)
 >>> re.sub(pat, '', linestr)
'blahfubarabc\nxyz\nxyzzy'

... and I can't imagine why you're using the confusing [IMHO] 
undocumented [AFAICT] feature that the first arg of the module-level 
functions like sub and friends can be a compiled regular expression 
object. Why not use this:

 >>> pat.sub('', linestr)
'blahfubarabc\nxyz\nxyzzy'
 >>>

One-liner fanboys might prefer this:

 >>> re.sub('(?i)(?<=abc\n).*?(?=xyz\n)', '', linestr)
'blahfubarabc\nxyz\nxyzzy'
 >>>

HTH,
John



More information about the Python-list mailing list