[Numpy-discussion] Enum/Factor NEP (now with code)

Wed Jun 13 18:11:03 EDT 2012

On Wed, Jun 13, 2012 at 5:19 PM, Bryan Van de Ven <bryanv at continuum.io> wrote:
> On 6/13/12 1:54 PM, Wes McKinney wrote:
>> OK, I need to spend some time on this as it will directly impact me.
>> Random thoughts here.
>>
>> It looks like the levels can only be strings. This is too limited for
>> my needs. Why not support all possible NumPy dtypes? In pandas world,
>> the levels can be any unique Index object (note, I'm going to change
>> the name of the Factor class to Categorical before 0.8.0 final per
>> discussion with Nathaniel):
>
> The current for-discussion prototype currently only supports strings. I
> had mentioned integral levels in the NEP but wanted to get more feedback
> first. It looks like you are using intervals as levels in things like
> qcut? This would add some complexity. I can think of a couple of
> possible approaches I will have to try a few of them out to see what
> would make the most sense.
>
>> The API for constructing an enum/factor/categorical array from fixed
>> levels and an array of labels seems somewhat weak to me. A very common
>> scenario is to need to construct a factor from an array of integers
>> with an associated array of levels:
>>
>>
>> In [13]: labels
>> Out[13]:
>> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
>>         1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
>>         0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
>>         0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
>>         7, 9, 7, 3, 3, 0, 4, 4])
>>
>> In [14]: levels
>> Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>
>> In [15]: Factor(labels, levels)
>> Out[15]:
>> Factor:
>> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
>>         1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
>>         0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
>>         0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
>>         7, 9, 7, 3, 3, 0, 4, 4])
>> Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>
> I originally had a very similar interface in the NEP. I was persuaded by
> Mark that this would be redundant:
>
> In [10]: levels = np.factor(['a', 'b', 'c'])   # or levels =
> np.factor_array(['a', 'b', 'c', 'a', 'b']).dtype
> In [11]: np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], levels)
> Out[11]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'],
> dtype='factor({'c': 2, 'a': 0, 'b': 1})')
>
> This should also spell even more closely to your example as:
>
> labels.astype(levels)
>
> but I have not done much with casting yet, so this currently complains.
> However, would this satisfy your needs (modulo the separate question
> about more general integral or object levels)?
>
>> What is the story for NA values (NaL?) in a factor array? I code them
>> as -1 in the labels, though you could use INT32_MAX or something. This
>> is very important in the context of groupby operations.
> I am just using INT32_MIN at the moment.
>> Are the levels ordered (Nathaniel brought this up already looks like)?
>> It doesn't look like it. That is also necessary. You also need to be
>
> They currently compare based on their value:
>
> In [20]: arr = np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'],
> np.factor({'c':0, 'b':1, 'a':2}))
> In [21]: arr
> Out[21]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'],
> dtype='factor({'c': 0, 'a': 2, 'b': 1})')
> In [22]: arr.sort()
> In [23]: arr
> Out[23]: array(['c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'a'],
> dtype='factor({'c': 0, 'a': 2, 'b': 1})')
>
>
>> able to sort the levels (which is a relabeling, I have lots of code in
>> use for this). In the context of groupby in pandas, when processing a
>> key (array of values) to a factor to be used for aggregating some
>> data, you have the option of returning an object that has the levels
>> as observed in the data or sorting. Sorting can obviously be very
>> expensive depending on the number of groups in the data
>> (http://wesmckinney.com/blog/?p=437). Example:
>>
>> from pandas import DataFrame
>> from pandas.util.testing import rands
>> import numpy as np
>>
>> df = DataFrame({'key' : [rands(10) for _ in xrange(100000)] * 10,
>>             'data' : np.random.randn(1000000)})
>>
>> In [32]: timeit df.groupby('key').sum()
>> 1 loops, best of 3: 374 ms per loop
>>
>> In [33]: timeit df.groupby('key', sort=False).sum()
>> 10 loops, best of 3: 185 ms per loop
>>
>> The "factorization time" for the `key` column dominates the runtime;
>> the factor is computed once then reused if you keep the GroupBy object
>> around:
>>
>> In [36]: timeit grouped.sum()
>> 100 loops, best of 3: 6.05 ms per loop
> Just some numbers for comparison. Factorization times:
>
> In [41]: lets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
> In [42]: levels = np.factor(lets)
> In [43]: data = [lets[int(x)] for x in np.random.randn(1000000)]
> In [44]: %timeit np.array(data, levels)
> 10 loops, best of 3: 137 ms per loop
>
> And retrieving group indicies/summing:
>
> In [8]: %timeit arr=='a'
> 1000 loops, best of 3: 1.52 ms per loop
> In [10]: vals = np.random.randn(1000000)
> In [20]: inds = [arr==x for x in lets]
> In [23]: %timeit for ind in inds: vals[ind].sum()
> 10 loops, best of 3: 48.3 ms per loop

(FYI you're comparing an O(NK) algorithm with an O(N) algorithm for small K)

> On my laptop your grouped.sum() took 22ms, so this is roughly off by
> about a factor of two. But we should compare it on the same hardware,
> and with the same level data types. There is no doubt room for
> improvement, though.
>
> It would not be too bad to add some groupby functionality on top of
> this. I still need to add a mechanism for accessing and iterating over
> the levels.
>
>> As another example of why ordered factors matter, consider a quantile
>> cut (google for the "cut" function in R) function I wrote recently:
>>
>>
>> In [40]: arr = Series(np.random.randn(1000000))
>>
>> In [41]: cats = qcut(arr, [0, 0.25, 0.5, 0.75, 1])
>>
>> In [43]: arr.groupby(cats).describe().unstack(0)
>> Out[43]:
>>         (-4.85, -0.673]  (-0.673, 0.00199]  (0.00199, 0.677]  (0.677, 4.914]
>> count    250000.000000      250000.000000     250000.000000   250000.000000
>> mean         -1.270623          -0.323092          0.326325        1.271519
>> std           0.491317           0.193254          0.193044        0.490611
>> min          -4.839798          -0.673224          0.001992        0.677177
>> 25%          -1.533021          -0.487450          0.158736        0.888502
>> 50%          -1.150136          -0.317501          0.320352        1.150480
>> 75%          -0.887974          -0.155197          0.490456        1.534709
>> max          -0.673224           0.001990          0.677176        4.913536
>>
>> If you don't have ordered levels, then the quantiles might come out in
>> the wrong order depending on how the strings sort or fall out of the
>> hash table.
> We do have ordered levels. :) Now, there's currently no way to get a
> list of the levels, in order, but that should be trivial to add.
>
>> Nathaniel: my experience (see blog posting above for a bit more) is
>> that khash really crushes PyDict for two reasons: you can use it with
>> primitive types and avoid boxing, and secondly you can preallocate.
>> Its memory footprint with large hashtables is also a fraction of
>> PyDict. The Python memory allocator is not problematic-- if you create
>> millions of Python objects expect the RAM usage of the Python process
>> to balloon absurdly.
>>
>> Anyway, this is exciting work assuming we get the API right and
>> hitting all the use cases. On top of all this I am _very_ performance
>> sensitive so you'll have to be pretty aggressive with benchmarking
>> things. I have concerns about ceding control over critical
>> functionality that I need for pandas (which has become a large and
>> very important library these days for a lot of people), but as long as
>> the pieces in NumPy are suitably mature and robust for me to switch to
>> them eventually that would be great.
>>
>> I'll do my best to stay involved in the discussion, though I'm
>> juggling a lot of things these days (e.g. I have the PyData book
>> deadline approaching like a freight train).
>>
>> - Wes
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion