Newbie here... getting a count of repeated instances in a list.
Peter Otten
__peter__ at web.de
Sat Nov 22 06:23:25 EST 2003
Amy G wrote:
> I started trying to learn python today. The program I am trying to write
Welcome to the worst programming language ... except all others :-)
> will open a text file containing email addresses and store them in a list.
> Then it will go through them saving only the domain portion of the email.
> After that it will count the number of times the domain occurs, and if
> above a certain threshhold, it will add that domain to a list or text
> file, or
> whatever. For now I just have it printing to the screen.
>
> This is my code, and it works and does what I want. But I want to do
> something with hash object to make this go a whole lot faster. Any
> suggestions are appreciated a great deal.
I think your code looks alright, just not very idiomatic, as one might
expect. I have tinkered with it a bit and came up with the version given
below. I hope the result is readable enough, so I have abused the comments
to give some hints regarding the language/library that you might find
useful. get_domains() creates a dictionary with domains as keys and the
number of occurences as values, e. g.
{"nowhere.com": 10, "elswhere.edu": 5}
The code for extracting the domain from a line is factored out in its own
function extract_domain(). Precautions have been taken to make the file
usable both as a library module and a stand-alone script.
import sys
def get_domains(lines):
"Generate a domain->frequency dict from lines"
domains = {}
# enumerate() is the pythonic equivalent for
# index = 0
# while index < len(alist):
# alist[index]
# index += 1
for lineno, line in enumerate(lines):
try:
domain = extract_domain(line)
except ValueError:
print >> sys.stderr, "IGNORING line %d: %r" % (lineno+1,
line.strip())
else:
# else in a try ... except ... else statement
# may look a bit strange at first, but ist
# really useful
domains[domain] = domains.get(domain, 0) + 1
return domains
def extract_domain(line):
"Remove the name part of an emal address"
try:
return line.split("@", 1)[1].strip()
except IndexError:
# in this short example, you could just catch
# the index error in get_domains; however, in the long run
# it pays for client code to always see the "right" exception
raise ValueError("Invalid email address format: %r" % line.strip())
def filter_domains(domains, threshold=10):
"The <threshold> most frequent domains in alphabetical order"
# below is a demo of a very popular construct
# called "list comprehension"
result = [domain for domain, freq in domains.iteritems() if freq >=
threshold]
# the list.sort() method returns None, so
# sorting may look a bit clumsy when
# you first encounter it
result.sort()
return result
# The __name__ == "__main__" test is a common idiom in Python.
# The code below is only executed if you run the script from the
# command line, but not if you import it into another module,
# thus allowing to use the above functions in other contexts.
if __name__ == "__main__":
# for proper handling of command line args, have a look at
# the optparse module
threshold = int(sys.argv[2])
# the file object is iterable, so in many
# cases you can avoid an intermediate list
# of the lines in a file
source = file(sys.argv[1])
try:
domain_histogram = get_domains(source)
finally:
# clean up behind you if something goes wrong in the
# try block
source.close()
print "domains with %d or more messages" % threshold
print "\n".join(filter_domains(domain_histogram, threshold))
Peter
More information about the Python-list
mailing list