[Numpy-discussion] Enum/Factor NEP (now with code)

Wed Jun 13 12:23:11 EDT 2012

On Wed, Jun 13, 2012 at 5:04 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 06/13/2012 03:33 PM, Nathaniel Smith wrote:
>> I'm inclined to say therefore that we should just drop the "open type"
>> idea, since it adds complexity but doesn't seem to actually solve the
>> problem it's designed for.
>
> If one wants to have an "open", hassle-free enum, an alternative would
> be to cryptographically hash the enum string. I'd trust 64 bits of hash
> for this purpose.
>
> The obvious disadvantage is the extra space used, but it'd be a bit more
> hassle-free compared to regular enums; you'd never have to fix the set
> of enum strings and they'd always be directly comparable across
> different arrays. HDF libraries etc. could compress it at the storage
> layer, storing the enum mapping in the metadata.

You'd trust 64 bits to be collision-free for all strings ever stored
in numpy, eternally? I wouldn't. Anyway, if the goal is to store an
arbitrary set of strings in 64 bits apiece, then there is no downside
to just using an object array + interning (like pandas does now), and
this *is* guaranteed to be collision free. Maybe it would be useful to
have a "heap string" dtype, but that'd be something different.

AFAIK all the cases where an explicit categorical type adds value over
this are the ones where having an explicit set of levels is useful.
Representing HDF5 enums or R factors requires a way to specify
arbitrary string<->integer mappings, and there are algorithms (e.g. in
charlton) that are much more efficient if they can figure out what the
set of possible levels is directly without scanning the whole array.

-N