Deleting a block of text

Alex Martelli alex at magenta.com
Tue Aug 22 11:16:11 EDT 2000


"Salman Sheikh" <ssheikh at writeme.com> wrote in message
news:383017927.966953168714.JavaMail.root at web142-mc.mail.com...
> I have a text file in which I am trying to delete a block of text, from
one
> text word all the way through another text word (many lines of text with
in
> between).
>
> How would I do that?
> Do I have to create a new file, write output from the input read. Search
for
> the first marker and then stop writing until after detecting the end
marker,
> and then start writing again to the output file again?

This approach is quite good in that it "scales" -- it will let you treat
HUGE files.  Howevever, if the file is small enough to fit in memory
"decently", you may make your life a great deal simpler, e.g.:

import re

thetext=open('thefile.txt','r').read()
thetext=re.sub(r'\bfirst\b.*\blast\b','',thetext,1)
open('thefile.txt','w').write(thetext)

The \b in the regular expressions mark word-boundaries, to
avoid false matches, since you did say 'from one text word' to
another (so you don't want to match 'firstly' or 'blast').
The 4th argument, worth 1, tells re.sub you only want to
substitute one occurrence of the regular expression (by
default, all non-overlapping occurrences are substituted).

This will only work for files 'decently' smaller than your
memory, since you need all the file's text to reside in
memory (in 2 copies at one instant) plus extra memory for
the regular expression engine to work in.  E.g., with a
typical good PC of today, say 128 MB of RAM, I would not
take this approach if I feared textfile.txt could be larger
than a few tens of megabytes.

But, often, the files one processes _are_ smaller than that,
and in this case it's nice to be able to express the needed
editing as simply as this.


> Is this a good approach? Does anyone have any suggestions or know of any
> functions to simplify this?

If your file does not fit in memory, or if it fits too
snugly, then you may have no real alternative to doing
things sequentially.  For speed, you will probably want
to work in large chunks anyway -- several megabytes at
a time.  In the initial state, you'll look for the
start-word in the chunk (e.g. with re, to avoid false
matches thanks to \b); if not found, dump the lot to
the output file.  Be sure to keep overlaps between two
adjacent blocks, lest the trigger-word be missed if it
falls at block boundary, of course!

If/when you do find the trigger start-word, you will
also look for the trigger end-word; if both are found,
then just avoid outputting that slice of the buffer,
and move to the final state in which everything that
is read is also output (still in chunks of several
megabytes at a time).  If the start trigger is found,
but not the end trigger, then you have no alternative
to moving to an intermediate-state where the end-trigger
is what you're looking for.

If you do need to process textfiles of many tens of
megabytes or larger, I don't think there is any real
alternative to this approach (save, maybe, buying more
RAM, if it's only many tens of megabytes and not many
_hundreds_:-).  For tasks of such magnitude, unless
they're one-off, it may be worth making a program more
complex if it buys you performance...


For processing typical small input texts, a few tens
of megabytes at most, see above.  Use the simplest
approach you can get away with, of course:-).


Alex






More information about the Python-list mailing list