[Numpy-discussion] proposal: smaller representation of string arrays

Wed Apr 26 15:07:47 EDT 2017

On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
>
> On 26.04.2017 19:08, Robert Kern wrote:
> > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor
> > <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> > wrote:
> >
> >> Indeed,
> >> Most of this discussion is irrelevant to numpy.
> >> Numpy only really deals with the in memory storage of strings. And in
> >> that it is limited to fixed length strings (in bytes/codepoints).
> >> How you get your messy strings into numpy arrays is not very relevant
to
> >> the discussion of a smaller representation of strings.
> >> You couldn't get messy strings into numpy without first sorting it out
> >> yourself before, you won't be able to afterwards.
> >> Numpy will offer a set of encodings, the user chooses which one is best
> >> for the use case and if the user screws it up, it is not numpy's
problem.
> >>
> >> You currently only have a few ways to even construct string arrays:
> >> - array construction and loops
> >> - genfromtxt (which is again just a loop)
> >> - memory mapping which I seriously doubt anyone actually does for the S
> >> and U dtype
> >
> > I fear that you decided that the discussion was irrelevant and thus did
> > not read it rather than reading it to decide that it was not relevant.
> > Because several of us have showed that, yes indeed, we do memory-map
> > string arrays.
> >
> > You can add to this list C APIs, like that of libhdf5, that need to
> > communicate (Unicode) string arrays.
> >
> > Look, I know I can be tedious, but *please* go back and read this
> > discussion. We have concrete use cases outlined. We can give you more
> > details if you need them. We all feel the pain of the rushed, inadequate
> > implementation of the U dtype. But each of our pains is a little bit
> > different; you obviously aren't experiencing the same pains that I am.
>
> I have read every mail and it has been a large waste of time, Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.

Yes, we have stated that FITS files with string arrays are currently being
read via memory mapping.

  http://docs.astropy.org/en/stable/io/fits/index.html

You were even pointed to a minor HDF5 implementation that memory maps:

https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_level.py#L682-L683

I'm afraid that I can't share the actual code of the full variety of
proprietary file formats that I've written code for, I can assure you that
I have memory mapped many string arrays in my time, usually embedded as
columns in structured arrays. It is not "nice to have"; it is "have done
many times and needs better support".

> In any case it is still irrelevant. My proposal only _adds_ additional
> cases that can be mmapped. It does not prevent you from doing what you
> have been doing before.

You are the one who keeps worrying about the additional complexity, both in
code and mental capacity of our users, of adding new overlapping dtypes and
solutions, and you're not wrong about that. I think it behooves us to
consider if there are solutions that solve multiple related problems at
once instead of adding new dtypes piecemeal to solve individual problems.

> >> Having a new dtype changes nothing here. You still need to create numpy
> >> arrays from python strings which are well defined and clean.
> >> If you put something in that doesn't encode you get an encoding error.
> >> No oddities like surrogate escapes are needed, numpy arrays are not
> >> interfaces to operating systems nor does numpy need to _add_ support
for
> >> historical oddities beyond what it already has.
> >> If you want to represent bytes exactly as they came in don't use a text
> >> dtype (which includes the S dtype, use i1).
> >
> > Thomas Aldcroft has demonstrated the problem with this approach. numpy
> > arrays are often interfaces to files that have tons of historical
oddities.
>
> This does not matter for numpy, the text dtype is well defined as bytes
> with a specific encoding and null padding.

You cannot dismiss something as "not mattering for *numpy*" just because
your new, *proposed* text dtype doesn't support it.

You seem to have fixed on a course of action and are defining everyone
else's use cases as out-of-scope because your course of action doesn't
support them. That's backwards. Define the use cases first, determine the
requirements, then build a solution that meets those requirements. We
skipped those steps before, and that's why we're all feeling the pain.

> If you have an historical
> oddity that does not fit, do not use the text dtype but use a pure byte
> array instead.

That's his status quo, and he finds it unworkable. Now, I have proposed a
way out of that by supporting ASCII-surrogateescape as a specific encoding.
It's not an ISO standard encoding, but the surrogeescape mechanism seems to
be what the Python world has settled on for such situations. Would you
support that with your parameterized-encoding text dtype?

> >> Concerning variable sized strings, this is simply not going to happen.
> >> Nobody is going to rewrite numpy to support it, especially not just for
> >> something as unimportant as strings.
> >> Best you are going to get (or better already have) is object arrays. It
> >> makes no sense to discuss it unless someone comes up with an actual
> >> proposal and the willingness to code it.
> >
> > No one has suggested such a thing. At most, we've talked about
> > specializing object arrays.
> >
> >> What is a relevant discussion is whether we really need a more compact
> >> but limited representation of text than 4-byte utf32 at all.
> >> Its usecase is for the most part just for python3 porting and saving
> >> some memory in some ascii heavy cases, e.g. astronomy.
> >> It is not that significant anymore as porting to python3 has mostly
> >> already happened via the ugly byte workaround and memory saving is
> >> probably not as significant in the context of numpy which is already
> >> heavy on memory usage.
> >>
> >> My initial approach was to not add a new dtype but to make unicode
> >> parametrizable which would have meant almost no cluttering of numpys
> >> internals and keeping the api more or less consistent which would make
> >> this a relatively simple addition of minor functionality for people
that
> >> want it.
> >> But adding a completely new partially redundant dtype for this usecase
> >> may be a too large change to the api. Having two partially redundant
> >> string types may confuse users more than our current status quo of our
> >> single string type (U).
> >>
> >> Discussing whether we want to support truncated utf8 has some merit as
> >> it is a decision whether to give the users an even larger gun to shot
> >> themselves in the foot with.
> >> But I'd like to focus first on the 1 byte type to add a symmetric API
> >> for python2 and python3.
> >> utf8 can always be added latter should we deem it a good idea.
> >
> > What is your current proposal? A string dtype parameterized with the
> > encoding (initially supporting the latin-1 that you desire and maybe
> > adding utf-8 later)? Or a latin-1-specific dtype such that we will have
> > to add a second utf-8 dtype at a later date?
>
> My proposal is a single new parameterizable dtype. Adding multiple
> dtypes for each encoding seems unnecessary to me given that numpy
> already supports parameterizable types.
> For example datetime is very similar, it is basically encoded integers.
> There are multiple encodings = units supported.

Okay great. What encodings are you intending to support? You seem to be
pushing against supporting UTF-8.

> > If you're not going to support arbitrary encodings right off the bat,
> > I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first
> > as they seem to knock off more use cases straight away.
>
> Please list the use cases in the context of numpy usage. hdf5 is the
> most obvious, but how exactly would hdf5 use an utf8 array in the actual
> implementation?

File reading:

The user requests data from a fixed-width UTF-8 Dataset. E.g. h5py:

  >>> a = h5['/some_utf8_array'][:]

h5py looks at the Dataset's shape (with the fixed width defined in bytes)
and allocates a numpy UTF-8 array with the dtype being given the same
bytewidth as specified by the Dataset. h5py fills in the data quickly in
bulk using libhdf5's efficient APIs for such data movement. The user now
has a numpy array whose scalars come out/go in as `unicode/str` objects.

File writing:

The user needs to create a string Dataset with Unicode characters. A
fixed-width UTF-8 Dataset is preferred (in this case) over HDF5
variable-width Datasets because the latter is not compressible, and the
strings are all reasonably close in size. The user's in-memory data may or
may not be in a UTF-8 array (it might be in an object array of
`unicode/str` string objects or a U-dtype array), but h5py can use numpy's
conversion machinery to turn it into a numpy UTF-8 array (much like it can
accept lists of floats and cast it to a float64 array). It can look at the
UTF-8 array's shape and itemsize to create the corresponding Dataset, and
then pass the array to libhdf5's efficient APIs for copying arrays of data
into a Dataset.

> What you save by having utf8 in the numpy array is replacing a decoding
> ane encoding step with a stripping null padding step.
> That doesn't seem very worthwhile compared to all their other overheads
> involved.

It's worthwhile enough that both major HDF5 bindings don't support Unicode
arrays, despite user requests for years. The sticking point seems to be the
difference between HDF5's view of a Unicode string array (defined in size
by the bytes of UTF-8 data) and numpy's current view of a Unicode string
array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/b44c7430/attachment.html>