[Numpy-discussion] statistics in python

Radim radimrehurek at seznam.cz
Fri Oct 14 11:58:33 EDT 2011


Hi Rense (cross-posting to the numpy mailing list because these guys
are awesome),

On Oct 13, 10:01 pm, Rense Lange <rense.la... at gmail.com> wrote:
> I have potentially millions of tuples <v1,v2,v3 ..., observation> and
> I want to create frequency distributions conditional on the values of
> discrete variables v1, v2, ... (e.g. the sums for boys vs. girls), or
> combinations thereof (poor boys, poor girls, rich boys, rich girls).
> Very few of the v1 x v2 x ... combinations might actually occur. Also,
> it is sometimes necessary to combine different data sets.
>
> Should I just use some DB system (and if so, which one is best within
> Python), or are there sparse matrix methods that are to be preferred?

yes, that sounds like a job for a database. Sqlite is built-in (=part
of standard Python library: `import sqlite3`).

Numpy supports structured arrays (records) as well:

>>> import numpy
>>> dt = numpy.dtype([('name', 'a10'), ('wealth', numpy.int32), ('sex', 'a1')])
>>> x = numpy.array([('Mary', 10000, 'F'), ('John', 1000, 'M'), ('unknown', -1, '?')], dtype=dt)
>>> print x[(x['wealth'] > 5000) & (x['sex'] == 'F')] # print records for all rich girls
[('Mary', 10000, 'F')]

so perhaps that could also fit your bill. There are no indexes but
it's more pleasant to work with, imo.

Note that "potentially millions of records" is not particularly big,
so as long as you don't have too many variables, some in-memory db
should be ok and will save you from headaches of dealing with complex
db setups. I have also heard very good things about pytables, though
i've never used it myself (gensim uses plain float matrices), you can
have a look there: http://pytables.org

HTH,
Radim


> > On Oct 12, 10:01 pm, Rense Lange <rense.la... at gmail.com> wrote:
>
> > > Gensim must be storing data very efficiently, and I need to do something
> > > similar for another application. Can you tell me what Python programming
> > > approach was used in Gensim, and are there perhaps particular sections in
> > > the Gensim code that I shoGuld be looking at for inspiration and examples?
>
> > that question is too broad. What kind of data and application do you
> > have? What sort of efficiency are you after? (fast access/little disk
> > space/fast load time/...?)
>
> > Radim



More information about the NumPy-Discussion mailing list