delete from pattern to pattern if it contains match

Fri Apr 22 07:10:46 EDT 2016

Peter Otten writes:

> harirammanohar at gmail.com wrote:
>
>> On Thursday, April 21, 2016 at 7:03:00 PM UTC+5:30, Jussi Piitulainen
>> wrote:
>>> harirammanohar at gmail.com writes:
>>> 
>>> > On Monday, April 18, 2016 at 12:38:03 PM UTC+5:30,
>>> > hariram... at gmail.com wrote:
>>> >> HI All,
>>> >> 
>>> >> can you help me out in doing below.
>>> >> 
>>> >> file:
>>> >> <start>
>>> >>  guava
>>> >> fruit
>>> >> <end>
>>> >> <start>
>>> >>  mango
>>> >> fruit
>>> >> <end>
>>> >> <start>
>>> >>  orange
>>> >> fruit
>>> >> <end>
>>> >> 
>>> >> need to delete from start to end if it contains mango in a file...
>>> >> 
>>> >> output should be:
>>> >> 
>>> >> <start>
>>> >>  guava
>>> >> fruit
>>> >> <end>
>>> >> <start>
>>> >>  orange
>>> >> fruit
>>> >> <end>
>>> >> 
>>> >> Thank you
>>> >
>>> > any one can guide me ? why xml tree parsing is not working if i have
>>> > root.tag and root.attrib as mentioned in earlier post...
>>> 
>>> Assuming the real consists of lines between a start marker and end
>>> marker, a winning plan is to collect a group of lines, deal with it, and
>>> move on.
>>> 
>>> The following code implements something close to the plan. You need to
>>> adapt it a bit to have your own source of lines and to restore the end
>>> marker in the output and to account for your real use case and for
>>> differences in taste and judgment. - The plan is as described above, but
>>> there are many ways to implement it.
>>> 
>>> from io import StringIO
>>> 
>>> text = '''\
>>> <start>
>>>   guava
>>> fruit
>>> <end>
>>> <start>
>>>   mango
>>> fruit
>>> <end>
>>> <start>
>>>   orange
>>> fruit
>>> <end>
>>> '''
>>> 
>>> def records(source):
>>>     current = []
>>>     for line in source:
>>>         if line.startswith('<end>'):
>>>             yield current
>>>             current = []
>>>         else:
>>>             current.append(line)
>>> 
>>> def hasmango(record):
>>>     return any('mango' in it for it in record)
>>> 
>>> for record in records(StringIO(text)):
>>>     hasmango(record) or print(*record)
>> 
>> Hi,
>> 
>> not working....this is the output i am getting...
>> 
>> \
>
> This means that the line
>
>>> text = '''\
>
> has trailing whitespace in your copy of the script.

That's a nuisance. I wish otherwise undefined escape sequences in
strings raised an error, similar to a stray space after a line
continuation character.

>>  <start>
>>    guava
>>  fruit
>> 
>> <start>
>>    orange
>>  fruit
>
> Jussi forgot to add the "<end>..." line to the group.

I didn't forget. I meant what I said when I said the OP needs to adapt
the code to (among other things) restore the end marker in the output.
If they can't be bothered to do anything at all, it's their problem.

It was already known that this is not the actual format of the data.

> To fix this change the generator to
>
> def records(source):
>     current = []
>     for line in source:
>         current.append(line)
>         if line.startswith('<end>'):
>             yield current
>             current = []

Oops, I notice that I forgot to start a new record only on encountering
a '<start>' line. That should probably be done, unless the format is
intended to be exactly a sequence of "<start>\n- -\n<end>\n".

>>>     hasmango(record) or print(*record)
>
> The
>
> print(*record)
>
> inserts spaces between record entries (i. e. at the beginning of all
> lines except the first) and adds a trailing newline.

Yes, I forgot about the space. Sorry about that.

The final newline was intentional. Perhaps I should have added the end
marker there instead (given my preference to not drag it together with
the data lines), like so:

   print(*record, sep = "", end = "<end>\n")

Or so:

   print(*record, sep = "")
   print("<end>")

Or so:

   for line in record:
       print(line.rstrip("\n")
   else:
       print("<end>")

Or:

   for line in record:
       print(line.rstrip("\n")
   else:
       if record and not record[-1].strip() == "<end>":
           print("<end>")

But all this is beside the point that to deal with the stated problem
one might want to obtain access to a whole record *first*, then check if
it contains "mango" in the intended way (details missing but at least
"mango\n" as a full line counts as an occurrence), and only *then* print
the whole record (if it doesn't contain "mango").

I can think of two other ways - one if the data can be accessed only
once - but they seem more complicated to me. Hm, well, if it's XML, as
stated in another branch of this thread and contrary to the form of the
example data in this branch, there's a third way that may be good, but
here I'm responding to a line-oriented format.

> You can avoid this by specifying the delimiters explicitly:
>
> if not hasmango(record):
>     print(*record, sep="", end="")
>
> Even with these changes code still looks somewhat brittle...

That depends on the actual data format, and on what really is intended
to trigger the filter. This approach is a complete waste of effort if
there are no guarantees of things being there on their own lines, for
example.

Ok, that "\ " not only looks brittle but actually is brittle. The one
time I used that slash, I now regret doing so. Here's a fixed version.
(Not sure of the significance of the number of spaces that start the
first data line. They seem to have doubled along the way.)

text = '''<start>
  guava
fruit
<end>
<start>
  mango
fruit
<end>
<start>
  orange
fruit
<end>
'''