number of different lines in a file

Ben Finney bignose+hates-spam at benfinney.id.au
Thu May 18 18:19:26 EDT 2006


"r.e.s." <r.s at ZZmindspring.com> writes:

> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.

I'd generalise it by allowing the caller to pass any iterable set of
items. A file handle can be iterated this way, but so can any
sequence or iterable.

    def count_distinct(seq):
        """ Count the number of distinct items """
        counts = dict()
        for item in seq:
            if not item in counts:
                counts[item] = 0
            counts[item] += 1
        return len(counts)

    >>> infile = file('foo.txt')
    >>> for line in file('foo.txt'):
    ...     print line,
    ...
    abc
    def
    ghi
    abc
    ghi
    def
    xyz
    abc
    abc
    def

    >>> infile = file('foo.txt')
    >>> print count_distinct(infile)
    5

-- 
 \            "A man may be a fool and not know it -- but not if he is |
  `\                                    married."  -- Henry L. Mencken |
_o__)                                                                  |
Ben Finney




More information about the Python-list mailing list