number of different lines in a file
Larry Bates
larry.bates at websafe.com
Thu May 18 18:10:06 EDT 2006
r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
> f = file(fn)
> x = f.readline().strip()
> L = []
> while x<>'':
> if x not in L:
> L = L + [x]
> x = f.readline().strip()
> return len(L)
>
> Would anyone care to point out improvements?
> Is there a better algorithm for doing this?
Sounds like homework, but I'll bite.
def number_distinct(fn):
hash_dict={}
total_lines=0
for line in open(fn, 'r'):
total_lines+=1
key=hash(line.strip())
if hash_dict.has_key(key): continue
hash_dict[key]=1
return total_lines, len(hash_dict.keys())
if __name__=="__main__":
fn='c:\\test.txt'
total_lines, distinct_lines=number_distinct(fn)
print "Total lines=%i, distinct lines=%i" % (total_lines, distinct_lines)
-Larry Bates
More information about the Python-list
mailing list