Newbie here... getting a count of repeated instances in a list.

Sat Nov 22 06:23:25 EST 2003

Amy G wrote:

> I started trying to learn python today.  The program I am trying to write

Welcome to the worst programming language ... except all others :-)

> will open a text file containing email addresses and store them in a list.
> Then it will go through them saving only the domain portion of the email.
> After that it will count the number of times the domain occurs, and if
> above a certain threshhold, it will add that domain to a list or text
> file, or
> whatever.  For now I just have it printing to the screen.
> 
> This is my code, and it works and does what I want.  But I want to do
> something with hash object to make this go a whole lot faster.  Any
> suggestions are appreciated a great deal.

I think your code looks alright, just not very idiomatic, as one might
expect. I have tinkered with it a bit and came up with the version given
below. I hope the result is readable enough, so I have abused the comments
to give some hints regarding the language/library that you might find
useful. get_domains() creates a dictionary with domains as keys and the
number of occurences as values, e. g.

{"nowhere.com": 10, "elswhere.edu": 5}

The code for extracting the domain from a line is factored out in its own
function extract_domain(). Precautions have been taken to make the file
usable both as a library module and a stand-alone script.

import sys

def get_domains(lines):
    "Generate a domain->frequency dict from lines"
    domains = {}
    # enumerate() is the pythonic equivalent for
    # index = 0
    # while index < len(alist):
    #     alist[index]
    #     index += 1
    for lineno, line in enumerate(lines):
        try:
            domain = extract_domain(line)
        except ValueError:
            print >> sys.stderr, "IGNORING line %d: %r" % (lineno+1,
line.strip())
        else:
            # else in a try ... except ... else statement
            # may look a bit strange at first, but ist
            # really useful
            domains[domain] = domains.get(domain, 0) + 1
    return domains

def extract_domain(line):
    "Remove the name part of an emal address"
    try:
        return line.split("@", 1)[1].strip()
    except IndexError:
        # in this short example, you could just catch
        # the index error in get_domains; however, in the long run
        # it pays for client code to always see the "right" exception
        raise ValueError("Invalid email address format: %r" % line.strip())

def filter_domains(domains, threshold=10):
    "The <threshold> most frequent domains in alphabetical order"
    # below is a demo of a very popular construct 
    # called "list comprehension"
    result = [domain for domain, freq in domains.iteritems() if freq >=
threshold]
    # the list.sort() method returns None, so
    # sorting may look a bit clumsy when
    # you first encounter it
    result.sort()
    return result

# The __name__ == "__main__" test is a common idiom in Python.
# The code below is only executed if you run the script from the
# command line, but not if you import it into another module,
# thus allowing to use the above functions in other contexts.
if __name__ == "__main__":
    # for proper handling of command line args, have a look at
    # the optparse module
    threshold = int(sys.argv[2])
    # the file object is iterable, so in many
    # cases you can avoid an intermediate list
    # of the lines in a file
    source = file(sys.argv[1])
    try:
        domain_histogram = get_domains(source)
    finally:
        # clean up behind you if something goes wrong in the 
        # try block
        source.close()

    print "domains with %d or more messages" % threshold
    print "\n".join(filter_domains(domain_histogram, threshold))

Peter