Newbie here... getting a count of repeated instances in a list.

Robert Brewer fumanchu at
Fri Nov 21 22:22:14 EST 2003

> I started trying to learn python today.  The program I am 
> trying to write
> will open a text file containing email addresses and store 
> them in a list.
> Then it will go through them saving only the domain portion 
> of the email.
> After that it will count the number of times the domain 
> occurs, and if above
> a certain threshhold, it will add that domain to a list or 
> text file, or
> whatever.  For now I just have it printing to the screen.
> This is my code, and it works and does what I want.  But I want to do
> something with hash object to make this go a whole lot faster.  Any
> suggestions are appreciated a great deal. doesn't work as written. The value of "d" within
count_domains(), for example, is never set. Avoid the temptation to
rewrite your working code into broken code for email purposes. :)

Here's a cleaned-up version, using hashes as you requested (untested):

def get_domains(email_list):
    """Form a dictionary of email addresses by domain."""
    domains = {}
    for email in email_list:
            domain = email.split('@', 1)[1]
        except IndexError:
            domain = 'No domain'
            domain_list = domains[domain]
        except KeyError:
            domain_list = domains[domain] = []
    return domains

def count_domains(domains, threshhold):
    """Domains which occur more than <threshhold> number of times."""
    threshhold_domains = []
    for key, addresses in domains.iteritems():
        if len(addresses) > threshhold:
    return threshhold_domains

file = open(sys.argv[1], 'r')
mail_list = file.readlines()
domains = get_domains(mail_list)
counted = count_domains(domains, 10)
print counted

Some commentary:

1) Notice the use of """docstrings""" instead of #comments for function
2) You could reduce count_domains() to a single line with a 'list
comprehension'. Look them up in the Tutorial. Instead of writing:

counted = count_domains(domains, 10)

you could write:

counted = [key for key in domains if len(domains[key]) > 10]

3) Notice the try/except block around the email.split() call, and that I
used 'No domain' for error cases. You might just as well use '', or
4) Since we are not modifying the lists in place, it's much cleaner to
use for: instead of while:.
5) I assumed the step which removed addresses from the list wasn't
necessary anymore. Put it back in if it is.
6) I forgot what point 6 was. I'm sure it'll come to me after I hit
'Send'. :)
7) Finally, if you just want the counting and aren't going to reuse the
dictionary of domain-sorted addresses, you can perform the counting
within the dictionary-formation code to save a *lot* of time. Combining
this with point 2 we get:

def domain_count(email_list):
    """Number of addresses for each domain."""
    domains = {}
    for email in email_list:
            domain = email.split('@', 1)[1]
        except IndexError:
            domain = 'No domain'
            domains[domain] += 1
        except KeyError:
            domains[domain] = 1
    return domains

file = open(sys.argv[1], 'r')
mail_list = file.readlines()
counted = [key for key, count in domain_count(mail_list) if count > 10]
print counted


Robert Brewer
Amor Ministries
fumanchu at

More information about the Python-list mailing list