[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 27 07:38:06 EDT 2017

2017-04-27 13:27 GMT+02:00 Neal Becker <ndbecker2 at gmail.com>:

> So while compression+ucs-4 might be OK for out-of-core representation,
> what about in-core?  blosc+ucs-4?  I don't think that works for mmap, does
> it?
>

Correct, the real problem is mmap for an out-of-core, HDF5 representation,
I presume.

For in-memory, there are several compressed data containers, like:

https://github.com/alimanfoo/zarr (meant mainly for multidimensional data
containers)
https://github.com/Blosc/bcolz (meant mainly for tabular data containers)

(there might be others).

>
> On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted <faltet at gmail.com> wrote:
>
>> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer at gmail.com>:
>>
>>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>
>>>> It's worthwhile enough that both major HDF5 bindings don't support
>>>> Unicode arrays, despite user requests for years. The sticking point seems
>>>> to be the difference between HDF5's view of a Unicode string array (defined
>>>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>>>> string array (because of UCS-4, defined by the number of
>>>> characters/codepoints/whatever). So there are HDF5 files out there
>>>> that none of our HDF5 bindings can read, and it is impossible to write
>>>> certain data efficiently.
>>>>
>>>>
>>>> I would really like to hear more from the authors of these libraries
>>>> about what exactly it is they feel they're missing. Is it that they want
>>>> numpy to enforce the length limit early, to catch errors when the array is
>>>> modified instead of when they go to write it to the file? Is it that they
>>>> really want an O(1) way to look at a array and know the maximum number of
>>>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>>>> is really annoying and files that need it are rare so they haven't had the
>>>> motivation to implement it? My impression is similar to Julian's: you
>>>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>>>> dozen lines of code, which is nothing compared to all the other hoops these
>>>> libraries are already jumping through, so if this is really the roadblock
>>>> then I must be missing something.
>>>>
>>>
>>> I actually agree with you. I think it's mostly a matter of convenience
>>> that h5py matched up HDF5 dtypes with numpy dtypes:
>>> fixed width ASCII -> np.string_/bytes
>>> variable length ASCII -> object arrays of np.string_/bytes
>>> variable length UTF-8 -> object arrays of unicode
>>>
>>> This was tenable in a Python 2 world, but on Python 3 it's broken and
>>> there's not an easy fix.
>>>
>>> We absolutely could fix h5py by mapping everything to object arrays of
>>> Python unicode strings, as has been discussed (
>>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this
>>> would be a fine but non-ideal solution, since there is currently no fixed
>>> width UTF-8 support.
>>>
>>> For fixed width ASCII arrays, this would mean increased convenience for
>>> Python 3 users, at the price of decreased convenience for Python 2 users
>>> (arrays now contain boxed Python objects), unless we made the h5py behavior
>>> dependent on the version of Python. Hence, we're back here, waiting for
>>> better dtypes for encoded strings.
>>>
>>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
>>> handling ASCII arrays as strings) and UTF-8 with length equal to the number
>>> of bytes.
>>>
>>
>> Well, I'll say upfront that I have not read this discussion in the fully,
>> but apparently some opinions from developers of HDF5 Python packages would
>> be welcome here, so here I go :) 
>>
>> As a long-time developer of one of the Python HDF5 packages (PyTables), I
>> have always been of the opinion that plain ASCII (for byte strings) and
>> UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing
>> large amounts of data, most specially for disk storage (but also using
>> compressed in-memory containers).  My rational is that, although UCS-4 may
>> require way too much space, compression would reduce that to basically the
>> space that is required by compressed UTF-8 (I won't go into detail, but
>> basically this is possible by using the shuffle filter).
>>
>> I remember advocating for UCS-4 adoption in the HDF5 library many years
>> ago (2007?), but I had no success and UTF-8 was decided to be the best
>> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
>> don't think there is a go back (not even adding UCS-4 support on it,
>> although I continue to think it would be a good idea).  So, I suppose that
>> if HDF5 is found to be an important format for NumPy users (and I think
>> this is the case), a solution for representing Unicode characters by using
>> UTF-8 in NumPy would be desirable (at the risk of making the implementation
>> more complex).
>>
>> Francesc
>> 
>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>>
>> --
>> Francesc Alted
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>

-- 
Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/cc8445cf/attachment.html>