[Newbie] Search-and-delete text processing problem...

Bengt Richter bokr at oz.net
Fri Apr 1 22:49:45 EST 2005


On Fri, 1 Apr 2005 17:33:59 -0800, "Todd_Calhoun" <anon at anon.com> wrote:

>I'm trying to learn about text processing in Python, and I'm trying to 
>tackle what should be a simple task.
>
>I have long text files of books with a citation between each paragraph,
Most text files aren't long enough to worry about, but you can avoid
reading in the whole file by just iterating, one line at a time. That is
the way a file object iterates by default, so there's not much to that.

>which might be like "Bill D. Smith, History through the Ages, p.5".
>
>So, I need to search for every line that starts with a certain string (in 
>this example, "Bill D. Smith"), and delete the whole line.
If you want to test what a string starts with, there's a string method for that.
E.g., if line is the string representing one line, line.startswith('Bill') would
return True or False.
>
>I've tried a couple of different things, but none seem to work.  Here's my 
>latest try.  I apologize in advance for being so clueless.
>
>##########################
>#Text search and delete line tool
>
>theInFile = open("test_f.txt", "r")
>theOutFile = open("test_f_out.txt", "w")
>
>allLines = theInFile.readlines()
This will create a list of lines, all (except perhaps the last, if
the file had no end-of-line character(s) at the very end) with '\n'
as the last character. There are various ways to strip the line ends,
but your use case doesn't appear to require it.

>
>for line in allLines:
     # line at this point contains each line successively as the loop proceeds,
     # but you don't know where in the sequence you are unless you provide for it,
     # e.g. by using
 for i, line in enumerate(allLines):
>    if line[3] == 'Bill':
The above line is comparing the 4th character of the line (indexing from 0) with 'Bill'
which is never going to be true, and will raise an IndexError if the line is shorter than
4 characters. Not what you want to do.
     if line.startswith('Bill'):  # note that this is case sensitive. Otherwise use line.lower().startswith('bill')

>        line == ' '
         the enumerate will give you an index you can use for this, but I doubt if you want and invisible space
         without a line ending in place of 'Bill ... \n'
         line[i] = '\n'  # makes an actual blank line , but you want to delete it, so this is not going to work
>

>
>theOutFile.writelines(allLines)

UIAM (untested) you should be able to do the entire job removing lines that start with 'Bill' thus:

 theInFile = open("test_f.txt", "r")
 theOutFile = open("test_f_out.txt", "w")
 theOutFile.writelines(line for line in theInfile if not line.startswith('Bill'))

Or just the line

 open("test_f_out.txt", "w").writelines(L for L in open("test_f.txt") if not L.startswith('Bill'))

If you need to remove lines starting with any name in a certain list, you can do that too, e.g.,

 delStarts = ['Bill', 'Bob', 'Sue']
 theInFile = open("test_f.txt", "r")
 theOutFile = open("test_f_out.txt", "w")
 for line in theInFile:
     for name in delStarts:
         if line.startswith(name): break
     else: # will happen if there is NO break, so line does not start with any delStarts name
         theOutFile.write(line) # write line out if not starting badly
 
(You could do that with a oneliner too, but it gets silly ;-)

If you have a LOT of names to check for, it could pay you to figure a way to split off the name
from the fron of a lines, and check if that is in a set instead using a delStart list.
If you do use delStart, put the most popular names at the front.

>#########################
>
>I know I could do it in Word fairly easily, but I'd like to learn the Python 
>way to do things.
Have fun.
>
>Thanks for any advice. 
>
HTH (nothing tested, sorry ;-)

Regards,
Bengt Richter



More information about the Python-list mailing list