[Tutor] Re: What is the best way to count the number of lines in a huge file?

Christopher Smith csmith@blakeschool.org
Sat, 08 Sep 2001 12:21:38 -0500


I just got chewed out by Ignacio for writing 

more=f.read(stuff)
while more
	process more
	more=f.read(stuff)

and it was suggested that I write

# suggestion 1
more=None
while not more=='':
	process more
	more=f.read(stuff)

I consider myself wounded by a friend :-)  How to handle
this construct in the "one right way" has bothered me and
even bothered me as I struggled with what to send in this
morning, Ignacio.  I thought of two other approaches:

# suggestion 2
while 1:
	more=f.read(stuff)
	if more=='':
		break
	process more

# suggestion 2b
while 1:
	more=f.read(stuff)
	if more<>'':
		process more
	else:
		break

and 

# suggestion 3
more='go'
while more<>'':
	more=f.read(stuff)
	if more<>'':
		process more
	
	
I now consider #2 to be the best; in #1 and #3 you are setting a flag
which must be something not equal to the terminating flag, though in 
both cases you can clearly see that the loop will initiate, there is
a chance of making a mistake on the initialization.  In addition #3
is a bit repugnant in that you have a test repeated twice (and, for
what got me chewed out in the first place, I think this double test
is prone to error).  

I prefer (and wouldn't mind comment) on proposal #2.  Here's what it
has going for it:

	-it's obvious the loop will start
	-it's soon obvious what will stop the loop
	-the stop condition and request for more data to process
	 occurs only once in the loop
	-it's better than 2b b/c in 2b the "else" loop is too far
	 away (if the process code is long)
	-it's better not to use an "else" part to reduce the amount
	 of indentation that must be done.

So...here is the updated lineCount incorporating this change and the
suggestion that whether or not to count the trailing line is specified
as an input option rather than being returned as count information.

Thanks for the constructive criticism :-)

/c

####

def lineCount(f,sep,countTrailer=1):
	'''Return a count of the number of occurance of sep that occur in f.
	By default this routine assumes that sep indicates the end of a line
	and that if the last line doesn't end in sep it should be counted anyway.
	To get a strict count of the occurances of sep send a 0 for the 3rd
argument.'''
	#
	# Notes:
	#  whatever is passed in as f must have a 'read' method
	#   that returns '' when there is no more to read.
	#  make chunksize as large as you can on your system
	#
	chunksize=262144
	sepl=len(sep)
	last=''
	count=0
	while 1:
		more=f.read(chunksize)
		if more=='':
			break
		chunk=last+more
		count+=chunk.count(sep)
		last=chunk[-sepl:] #there might be a split sep in here
		if last==sep:      #nope, just a whole one that we already counted
			last=''
	
	if last<>'' and countTrailer==1:
		count+=1

	return count