December 2002 comp.lang.* stats

Sun Jan 26 11:46:26 EST 2003

> -----Original Message-----
> From: python-list-admin at python.org
> [mailto:python-list-admin at python.org]On Behalf Of Peter Hansen

...

> Thanks Aaron.  I'm forced to admit that these numbers *appear* to
> correspond to my purely subjective feeling as to the relative
> popularity, in a very vague way, of these languages.   It will
> be interesting - if you can finish refining the script and then
> "lock it down" - to compare the results over time.

I've been working on a much more comprehensive analysis of Java and Dotnet
developer activity, covering close to 5,000 sources (Usetnet groups, mailing
lists, web forums -- a source is one group list, forum, etc.).  I just
recently threw Python into the mix, mainly because it's what I'm using to
gather the data and do much of the analysis.  This amounts to more than
5,000 messages a day.  It isn't just venues for supporting the language; it
includes open-source projects being created with Java and Python (by
definition, aren't many true open source project done with Dotnet, except
for Mono and things like that).

Another quick way to get a sense of relative momentum is to look at
Sourceforge's "software map:"
http://sourceforge.net/softwaremap/trove_list.php?form_cat=160 and then
drill down to see the activity levels for the top projects for each
platform.  For example, the VB projects' activity levels drop off much
faster than the Python projects.  And you could keep digging deeper just at
Sourceforge, measuring what's really going on in each area.

I'm developing a number of metrics out of this data, some of which I'll be
making public.  But this toolkit is mostly for me to use in providing
intelligence (but not, not, NOT e-mail addresses for spamming!) to my
company's clients.

O'Reilly & Associates has been doing this sort of thing for quite a while,
to forecast demand for books about open source software, in particular.  I
did some brainstorming with them a few years ago and later started Opion,
which applied this kind of analysis to stock market discussions, feature
films and other topics.  That's now owned by Intelliseek, which mostly does
consumer market research.

One thing that became clear early when I built the Opion prototype was that
unique participants is far and away the most meaningful basic statistic --
far more than number of posts.  That was in stock market discussions, but I
haven't seen any reason to believe there a difference elsewhere.

At Opion and in my current work, I put a big emphasis on identifying the
most influential participants in the discussions through traffic analysis,
link analysis, etc., and giving higher weight to their activities.  Spammers
end up with very low weights because they almost never trigger a meaningful
response from the community... or they cross-post so widely, with no depth
to the resulting discussions, that they're easily identified.  I also use
various anti-spam mechanisms on my server.  Happily, it's not much of an
issue on the mailing lists, where a lot of the action is.

If any folks here are seriously interested in this area, we may need some
technical help soon, preferably in the South Bay area, where I am.

Nick