[Numpy-discussion] String type again.

Stephan Hoyer shoyer at gmail.com
Tue Jul 15 14:21:39 EDT 2014


On Mon, Jul 14, 2014 at 10:00 AM, Olivier Grisel <olivier.grisel at ensta.org>
wrote:

> 2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndarray at mac.com>:
> > I've been toying with the idea of creating an array type for interned
> > strings.  In many applications dealing with large arrays of variable size
> > strings, the strings come from a relatively short set of names.  Arrays
> of
> > interned strings can be manipulated very efficiently because in may
> respects
> > they are just like arrays of integers.
>
> +1 I think this is why pandas is using dtype=object to load string
> data: in many cases short string values are used to represent
> categorical variables with a comparatively small cardinality of
> possible values for a dataset with comparatively numerous records.
>

Pandas has a new "categorical" type (just merged into master) which is
pretty similar to interned strings:
https://github.com/pydata/pandas/pull/7217
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

Of course, it would be ideal for numpy itself to natively support
categoricals and variables length strings.

Best,
Stephan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140715/b54371c7/attachment.html>


More information about the NumPy-Discussion mailing list