Newbie here... getting a count of repeated instances in a list.

Fri Nov 21 22:22:14 EST 2003

> I started trying to learn python today.  The program I am 
> trying to write
> will open a text file containing email addresses and store 
> them in a list.
> Then it will go through them saving only the domain portion 
> of the email.
> After that it will count the number of times the domain 
> occurs, and if above
> a certain threshhold, it will add that domain to a list or 
> text file, or
> whatever.  For now I just have it printing to the screen.
> 
> This is my code, and it works and does what I want.  But I want to do
> something with hash object to make this go a whole lot faster.  Any
> suggestions are appreciated a great deal.

Well...it doesn't work as written. The value of "d" within
count_domains(), for example, is never set. Avoid the temptation to
rewrite your working code into broken code for email purposes. :)

Here's a cleaned-up version, using hashes as you requested (untested):

def get_domains(email_list):
    """Form a dictionary of email addresses by domain."""
    domains = {}
    for email in email_list:
        try:
            domain = email.split('@', 1)[1]
        except IndexError:
            domain = 'No domain'
        try:
            domain_list = domains[domain]
        except KeyError:
            domain_list = domains[domain] = []
        domain_list.append(email.strip())
    return domains

def count_domains(domains, threshhold):
    """Domains which occur more than <threshhold> number of times."""
    threshhold_domains = []
    for key, addresses in domains.iteritems():
        if len(addresses) > threshhold:
            threshhold_domains.append(key)
    return threshhold_domains

file = open(sys.argv[1], 'r')
mail_list = file.readlines()
domains = get_domains(mail_list)
counted = count_domains(domains, 10)
print counted

Some commentary:

1) Notice the use of """docstrings""" instead of #comments for function
descriptions.
2) You could reduce count_domains() to a single line with a 'list
comprehension'. Look them up in the Tutorial. Instead of writing:

counted = count_domains(domains, 10)

you could write:

counted = [key for key in domains if len(domains[key]) > 10]

3) Notice the try/except block around the email.split() call, and that I
used 'No domain' for error cases. You might just as well use '', or
None.
4) Since we are not modifying the lists in place, it's much cleaner to
use for: instead of while:.
5) I assumed the step which removed addresses from the list wasn't
necessary anymore. Put it back in if it is.
6) I forgot what point 6 was. I'm sure it'll come to me after I hit
'Send'. :)
7) Finally, if you just want the counting and aren't going to reuse the
dictionary of domain-sorted addresses, you can perform the counting
within the dictionary-formation code to save a *lot* of time. Combining
this with point 2 we get:

def domain_count(email_list):
    """Number of addresses for each domain."""
    domains = {}
    for email in email_list:
        try:
            domain = email.split('@', 1)[1]
        except IndexError:
            domain = 'No domain'
        try:
            domains[domain] += 1
        except KeyError:
            domains[domain] = 1
    return domains

file = open(sys.argv[1], 'r')
mail_list = file.readlines()
counted = [key for key, count in domain_count(mail_list) if count > 10]
print counted

HTH!

Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org