number of different lines in a file

Bill Pursell bill.pursell at gmail.com
Thu May 18 18:27:17 EDT 2006


r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
>     f = file(fn)
>     x = f.readline().strip()
>     L = []
>     while x<>'':
>         if x not in L:
>             L = L + [x]
>         x = f.readline().strip()
>     return len(L)
>
> Would anyone care to point out improvements?
> Is there a better algorithm for doing this?

Have you tried
cat file | sort | uniq | wc -l ?
sort might choke on the large file, and this isn't python, but it
might work.   You might try breaking the file into
smaller peices, maybe based on the first character, and then
process them seperately.  The time killer is probably
the "x not in L" line, since L is getting very large.  By
subdividing the problem initially, that time constraint
will  be better.




More information about the Python-list mailing list