[Numpy-discussion] adding a cut function to numpy

Mon Apr 16 17:27:27 EDT 2012

Hi,

I have a pull request here [1] to add a cut function similar to R's
[2]. It seems there are often requests for similar functionality. It's
something I'm making use of for my own work and would like to use in
statstmodels and in generating instances of pandas' Factor class, but
is this generally something people would find useful to warrant its
inclusion in numpy? It will be even more useful I think with an enum
dtype in numpy.

If you aren't familiar with cut, here's a potential use case. Going
from a continuous to a categorical variable.

Given a continuous variable

[~/]
[8]: age = np.random.randint(15,70, size=100)

[~/]
[9]: age
[9]:
array([58, 32, 20, 25, 34, 69, 52, 27, 20, 23, 51, 61, 39, 54, 39, 44, 27,
       17, 29, 18, 66, 25, 44, 21, 54, 32, 50, 60, 25, 41, 68, 25, 42, 69,
       50, 69, 24, 69, 69, 48, 30, 20, 18, 15, 50, 48, 44, 27, 57, 52, 40,
       27, 58, 45, 44, 32, 54, 19, 36, 32, 55, 17, 55, 15, 19, 29, 22, 25,
       36, 44, 29, 53, 37, 31, 51, 39, 21, 66, 25, 26, 20, 17, 41, 50, 27,
       23, 62, 69, 65, 34, 38, 61, 39, 34, 38, 35, 18, 36, 29, 26])

Give me a variable where people are in age groups (lower bound is not inclusive)

[~/]
[10]: groups = [14, 25, 35, 45, 55, 70]

[~/]
[11]: age_cat = np.cut(age, groups)

[~/]
[12]: age_cat
[12]:
array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1, 3,
       1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4, 4,
       3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1, 3,
       3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3, 5,
       3, 2, 3, 2, 1, 3, 2, 2])

Skipper

[1] https://github.com/numpy/numpy/pull/248
[2] http://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html