[Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern robert.kern at gmail.com
Thu Apr 20 16:00:48 EDT 2017


On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
>
> On 20.04.2017 20:53, Robert Kern wrote:
> > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> > <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> > wrote:
> >
> >> Do you have comments on how to go forward, in particular in regards to
> >> new dtype vs modify np.unicode?
> >
> > Can we restate the use cases explicitly? I feel like we ended up with
> > the current sub-optimal situation because we never really laid out the
> > use cases. We just felt like we needed bytestring and unicode dtypes,
> > more out of completionism than anything, and we made a bunch of
> > assumptions just to get each one done. I think there may be broad
> > agreement that many of those assumptions are "wrong", but it would be
> > good to reference that against concretely-stated use cases.
>
> We ended up in this situation because we did not take the opportunity to
> break compatibility when python3 support was added.

Oh, the root cause I'm thinking of long predates Python 3, or even numpy
1.0. There never was an explicitly fleshed out use case for unicode arrays
other than "Python has unicode strings, so we should have a string dtype
that supports it". Hence the "we only support UCS4" implementation; it's
not like anyone *wants* UCS4 or interoperates with UCS4, but it does
represent all possible Unicode strings. The Python 3 transition merely
exacerbated the problem by making Unicode strings the primary string type
to work with. I don't really want to ameliorate the exacerbation without
addressing the root problem, which is worth solving.

I will put this down as a marker use case: Support HDF5's fixed-width UTF-8
arrays.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/8181e475/attachment.html>


More information about the NumPy-Discussion mailing list