[Numpy-discussion] proposal: smaller representation of string arrays

Wed Apr 26 14:38:09 EDT 2017

On Wed, 2017-04-26 at 19:43 +0200, Julian Taylor wrote:
> On 26.04.2017 19:08, Robert Kern wrote:
> > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor
> > <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.co
> > m>>
> > wrote:
> > 
> > > Indeed,
> > > Most of this discussion is irrelevant to numpy.
> > > Numpy only really deals with the in memory storage of strings.
> > > And in
> > > that it is limited to fixed length strings (in bytes/codepoints).
> > > How you get your messy strings into numpy arrays is not very
> > > relevant to
> > > the discussion of a smaller representation of strings.
> > > You couldn't get messy strings into numpy without first sorting
> > > it out
> > > yourself before, you won't be able to afterwards.
> > > Numpy will offer a set of encodings, the user chooses which one
> > > is best
> > > for the use case and if the user screws it up, it is not numpy's
> > > problem.
> > > 
> > > You currently only have a few ways to even construct string
> > > arrays:
> > > - array construction and loops
> > > - genfromtxt (which is again just a loop)
> > > - memory mapping which I seriously doubt anyone actually does for
> > > the S
> > > and U dtype
> > 
> > I fear that you decided that the discussion was irrelevant and thus
> > did
> > not read it rather than reading it to decide that it was not
> > relevant.
> > Because several of us have showed that, yes indeed, we do memory-
> > map
> > string arrays.
> > 
> > You can add to this list C APIs, like that of libhdf5, that need to
> > communicate (Unicode) string arrays.
> > 
> > Look, I know I can be tedious, but *please* go back and read this
> > discussion. We have concrete use cases outlined. We can give you
> > more
> > details if you need them. We all feel the pain of the rushed,
> > inadequate
> > implementation of the U dtype. But each of our pains is a little
> > bit
> > different; you obviously aren't experiencing the same pains that I
> > am.
> 
> I have read every mail and it has been a large waste of time,
> Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.
> In any case it is still irrelevant. My proposal only _adds_
> additional
> cases that can be mmapped. It does not prevent you from doing what
> you
> have been doing before.
> 
> > 
> > > Having a new dtype changes nothing here. You still need to create
> > > numpy
> > > arrays from python strings which are well defined and clean.
> > > If you put something in that doesn't encode you get an encoding
> > > error.
> > > No oddities like surrogate escapes are needed, numpy arrays are
> > > not
> > > interfaces to operating systems nor does numpy need to _add_
> > > support for
> > > historical oddities beyond what it already has.
> > > If you want to represent bytes exactly as they came in don't use
> > > a text
> > > dtype (which includes the S dtype, use i1).
> > 
> > Thomas Aldcroft has demonstrated the problem with this approach.
> > numpy
> > arrays are often interfaces to files that have tons of historical
> > oddities.
> 
> This does not matter for numpy, the text dtype is well defined as
> bytes
> with a specific encoding and null padding. If you have an historical
> oddity that does not fit, do not use the text dtype but use a pure
> byte
> array instead.
> 
> > 
> > > Concerning variable sized strings, this is simply not going to
> > > happen.
> > > Nobody is going to rewrite numpy to support it, especially not
> > > just for
> > > something as unimportant as strings.
> > > Best you are going to get (or better already have) is object
> > > arrays. It
> > > makes no sense to discuss it unless someone comes up with an
> > > actual
> > > proposal and the willingness to code it.
> > 
> > No one has suggested such a thing. At most, we've talked about
> > specializing object arrays.
> > 
> > > What is a relevant discussion is whether we really need a more
> > > compact
> > > but limited representation of text than 4-byte utf32 at all.
> > > Its usecase is for the most part just for python3 porting and
> > > saving
> > > some memory in some ascii heavy cases, e.g. astronomy.
> > > It is not that significant anymore as porting to python3 has
> > > mostly
> > > already happened via the ugly byte workaround and memory saving
> > > is
> > > probably not as significant in the context of numpy which is
> > > already
> > > heavy on memory usage.
> > > 
> > > My initial approach was to not add a new dtype but to make
> > > unicode
> > > parametrizable which would have meant almost no cluttering of
> > > numpys
> > > internals and keeping the api more or less consistent which would
> > > make
> > > this a relatively simple addition of minor functionality for
> > > people that
> > > want it.
> > > But adding a completely new partially redundant dtype for this
> > > usecase
> > > may be a too large change to the api. Having two partially
> > > redundant
> > > string types may confuse users more than our current status quo
> > > of our
> > > single string type (U).
> > > 
> > > Discussing whether we want to support truncated utf8 has some
> > > merit as
> > > it is a decision whether to give the users an even larger gun to
> > > shot
> > > themselves in the foot with.
> > > But I'd like to focus first on the 1 byte type to add a symmetric
> > > API
> > > for python2 and python3.
> > > utf8 can always be added latter should we deem it a good idea.
> > 
> > What is your current proposal? A string dtype parameterized with
> > the
> > encoding (initially supporting the latin-1 that you desire and
> > maybe
> > adding utf-8 later)? Or a latin-1-specific dtype such that we will
> > have
> > to add a second utf-8 dtype at a later date?
> 
> My proposal is a single new parameterizable dtype. Adding multiple
> dtypes for each encoding seems unnecessary to me given that numpy
> already supports parameterizable types.
> For example datetime is very similar, it is basically encoded
> integers.
> There are multiple encodings = units supported.
> 
> > 
> > If you're not going to support arbitrary encodings right off the
> > bat,
> > I'd actually suggest implementing UTF-8 and ASCII-surrogateescape
> > first
> > as they seem to knock off more use cases straight away.
> > 
> 
> 
> Please list the use cases in the context of numpy usage. hdf5 is the
> most obvious, but how exactly would hdf5 use an utf8 array in the
> actual
> implementation?
> 
> What you save by having utf8 in the numpy array is replacing a
> decoding
> ane encoding step with a stripping null padding step.
> That doesn't seem very worthwhile compared to all their other
> overheads
> involved.

I remember talking with a colleague about something like that. And
basically an annoying thing there was that if you strip the zero bytes
in a zero padded string, some encodings (UTF16) may need one of the
zero bytes to work right. (I think she got around it, by weird
trickery, inverting the endianess or so and thus putting the zero bytes
first).
Maybe will ask her if this discussion is interesting to her. Though I
think it might have been something like "make everything in
hdf5/something similar work" without any actual use case, I don't know.

Have not read the thread, I think a fixed byte but settable encoding
type would make sense. I personally wonder whether storing the length
might make sense, even if that removes direct memory mapping, but as
you said, you can still memmap the bytes, and then probably just cast
back and forth.

Sorry if there is zero actual input here :)

- Sebastian

> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/dbb8d9ee/attachment-0001.sig>