number of different lines in a file

Larry Bates larry.bates at websafe.com
Thu May 18 18:10:06 EDT 2006


r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
> 
> On my PC, this little program just goes to never-never land:
> 
> def number_distinct(fn):
>     f = file(fn)
>     x = f.readline().strip()
>     L = []
>     while x<>'':
>         if x not in L:
>             L = L + [x]
>         x = f.readline().strip()
>     return len(L) 
> 
> Would anyone care to point out improvements? 
> Is there a better algorithm for doing this?

Sounds like homework, but I'll bite.

def number_distinct(fn):
    hash_dict={}
    total_lines=0
    for line in open(fn, 'r'):
        total_lines+=1
        key=hash(line.strip())
        if hash_dict.has_key(key): continue
        hash_dict[key]=1

    return total_lines, len(hash_dict.keys())

if __name__=="__main__":
    fn='c:\\test.txt'
    total_lines, distinct_lines=number_distinct(fn)
    print "Total lines=%i, distinct lines=%i" % (total_lines, distinct_lines)


-Larry Bates



More information about the Python-list mailing list