number of different lines in a file

r.e.s. r.s at ZZmindspring.com
Thu May 18 20:04:30 EDT 2006


"Tim Chase" <python.list at tim.thechases.com> wrote ...
> 2)  use a python set:
> 
> s = set()
> for line in open("file.in"):
> s.add(line.strip())
> return len(s)
> 
> 3)  compact #2:
> 
> return len(set([line.strip() for line in file("file.in")]))
> 
> or, if stripping the lines isn't a concern, it can just be
> 
> return len(set(file("file.in")))
> 
> The logic in the set keeps track of ensuring that no 
> duplicates get entered.
> 
> Depending on how many results you *expect*, this could 
> become cumbersome, as you have to have every unique line in 
> memory.  A stream-oriented solution can be kinder on system 
> resources, but would require that the input be sorted first.

Thank you (and all the others who responded!) -- set() does 
the trick, reducing the job to about a minute.  I may play
later with the other alternatives people mentionsed (dict(), 
hash(),...), just out of curiosity.  I take your point about
the "expected number", which in my case was around 0-10 (as
it turned out, there were no dups).   

BTW, the first thing I tried was Fredrik Lundh's program:

def number_distinct(fn):
     return len(set(s.strip() for s in open(fn)))

which worked without the square brackets. Interesting that 
omitting them doesn't seem to matter.




More information about the Python-list mailing list