[Numpy-discussion] proposal: smaller representation of string arrays
Chris Barker
chris.barker at noaa.gov
Mon Apr 24 14:21:48 EDT 2017
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
> BTW -- maybe we should keep the pathological use-case in mind: really
>> short strings. I think we are all thinking in terms of longer strings,
>> maybe a name field, where you might assign 32 bytes or so -- then someone
>> has an accented character in their name, and then ge30 or 31 characters --
>> no big deal.
>>
>
> I wouldn't call it a pathological use case, it doesn't seem so uncommon to
> have large datasets of short strings.
>
It's pathological for using a variable-length encoding.
> I personally deal with a database of hundreds of billions of 2 to 5
> character ASCII strings. This has been a significant blocker to Python 3
> adoption in my world.
>
I agree -- it is a VERY common case for scientific data sets. But a
one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
Unicode. The wasted space is not that big a deal with short strings...
BTW, for those new to the list or with a short memory, this topic has been
> discussed fairly extensively at least 3 times before. Hopefully the
> *fourth* time will be the charm!
>
yes, let's hope so!
The big difference now is that Julian seems to be committed to actually
making it happen!
Thanks Julian!
Which brings up a good point -- if you need us to stop the damn
bike-shedding so you can get it done -- say so.
I have strong opinions, but would still rather see any of the ideas on the
table implemented than nothing.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/6621b043/attachment.html>
More information about the NumPy-Discussion
mailing list