Using the nntplib module to count Google Groups users

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Oct 26 23:32:25 EDT 2013


There's been a bit of a discussion about how prevalent Google Groups 
users are in this forum. This is a good opportunity to use one of 
Python's standard library modules to scan through the comp.lang.python 
newsgroup and find out. So here's some code to do so:


import nntplib
import sys
s = nntplib.NNTP('news.internode.on.net')  # footnote [1]
resp, count, first, last, name = s.group('comp.lang.python')
print 'Group', name, 'has', count, 'articles, range', first, 'to', last
print 'Checking the most recent (approx) 5000 messages...'
last = int(last)
count = 0
gg = 0
template = "\rArticle %d: found %d Google Groups headers."
try:
    for id in range(last-5000, last+1):
        try:
            headers = s.head(str(id))
        except Exception:
            continue
        count += 1
        for line in headers:
            if "google" in line and "group" in line:
                gg += 1
                sys.stdout.write(template % (id, gg))
                sys.stdout.flush()
                break
except KeyboardInterrupt:
    pass
finally:
    print

s.quit()
print "Google Groups posts: %.2f%% of %d" % (gg*100.0/count, count)



Footnote [1] For this to work, you will need to be a subscriber with the 
ISP Internode. If you are not, you will need to substitute your ISP's 
news server. (Or your own, if you are running your own news server.)


This is a relatively busy newsgroup, and consequently downloading all the 
headers may take a while, which is why I have limited it to only the most 
recent 5000. I get this output:

Group comp.lang.python has 150071 articles, range 369087 to 519157
Checking the most recent (approx) 5000 messages...
Article 519153: found 957 Google Groups headers.
'205 Transferred 12653216 bytes in 0 articles, 0 groups.  Disconnecting.'
Google Groups posts: 19.14% of 5001


Note that this *definitely* over-counts Google Groups. It also includes 
replies to GG posts, as well as those actually sent via GG. There are 
other false-positives as well. But as a rough-and-ready estimate, I think 
it is good evidence that fewer than 1 in 5 posts come from Google Groups, 
so definitely a minority, and by a long way.

Naturally this doesn't count lurkers who read via GG but never post. Nor 
does it count distinct users, only distinct posts.

If anyone wants to modify the script to determine the ratio of posters, 
rather than posts, using GG, be my guest. I'd be interested in the 
answer, but not interested enough to actually do the work myself.



-- 
Steven



More information about the Python-list mailing list