[Numpy-discussion] proposal: smaller representation of string arrays

Julian Taylor jtaylor.debian at googlemail.com
Thu Apr 20 15:27:17 EDT 2017


On 20.04.2017 20:53, Robert Kern wrote:
> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> wrote:
> 
>> Do you have comments on how to go forward, in particular in regards to
>> new dtype vs modify np.unicode?
> 
> Can we restate the use cases explicitly? I feel like we ended up with
> the current sub-optimal situation because we never really laid out the
> use cases. We just felt like we needed bytestring and unicode dtypes,
> more out of completionism than anything, and we made a bunch of
> assumptions just to get each one done. I think there may be broad
> agreement that many of those assumptions are "wrong", but it would be
> good to reference that against concretely-stated use cases.

We ended up in this situation because we did not take the opportunity to
break compatibility when python3 support was added.
We should have made the string dtype an encoded byte type (ascii or
latin1) in python3 instead of null terminated unencoded bytes which do
not make very much practical sense.

So the use case is very simple: Give users of the string dtype a
migration path that does not involve converting to full utf32 unicode.
The latin1 encoded bytes dtype would allow that.

As we already have the infrastructure this same dtype can allow more
than just latin1 with minimal effort, for the fixed size python
supported stuff it is literally adding an enum entry, two new switch
clauses and a little bit of dtype string parsing and testcases.


Having some form of variable string handling would be nice. But this is
another topic all together.
Having builtin support for variable strings only seems overkill as the
string dtype is not that important and object arrays should work
reasonably well for this usecase already.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/4cc2fe7a/attachment-0001.sig>


More information about the NumPy-Discussion mailing list