[Numpy-discussion] proposal: smaller representation of string arrays

Wed Apr 26 11:19:13 EDT 2017

On Wed, Apr 26, 2017 at 3:15 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:

> On 26.04.2017 03:55, josef.pktd at gmail.com wrote:
> > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris
> > <charlesr.harris at gmail.com> wrote:
> >>
> >>
> >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>>
> >>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal
> >>> <chris.barker at noaa.gov> wrote:
> >>>
> >>>>> Presumably you're getting byte strings (with  unknown encoding.
> >>>>
> >>>> No -- thus is for creating and using mostly ascii string data with
> >>>> python and numpy.
> >>>>
> >>>> Unknown encoding bytes belong in byte arrays -- they are not text.
> >>>
> >>> You are welcome to try to convince Thomas of that. That is the status
> quo
> >>> for him, but he is finding that difficult to work with.
> >>>
> >>>> I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
> >>>> with a few extra characters" data. With all the sloppiness over the
> years,
> >>>> there are way to many files like that.
> >>>
> >>> That sloppiness that you mention is precisely the "unknown encoding"
> >>> problem. Your previous advocacy has also touched on using latin-1 to
> decode
> >>> existing files with unknown encodings as well. If you want to advocate
> for
> >>> using latin-1 only for the creation of new data, maybe stop talking
> about
> >>> existing files? :-)
> >>>
> >>>> Note: the primary use-case I have in mind is working with ascii text
> in
> >>>> numpy arrays efficiently-- folks have called for that. All I'm saying
> is use
> >>>> Latin-1 instead of ascii -- that buys you some useful extra
> characters.
> >>>
> >>> For that use case, the alternative in play isn't ASCII, it's UTF-8,
> which
> >>> buys you a whole bunch of useful extra characters. ;-)
> >>>
> >>> There are several use cases being brought forth here. Some involve file
> >>> reading, some involve file writing, and some involve in-memory
> manipulation.
> >>> Whatever change we make is going to impinge somehow on all of the use
> cases.
> >>> If all we do is add a latin-1 dtype for people to use to create new
> >>> in-memory data, then someone is going to use it to read existing data
> in
> >>> unknown or ambiguous encodings.
> >>
> >>
> >>
> >> The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to
> >> size arrays by character length. The advantage over UTF-32 is that it is
> >> easily compressible, probably by a factor of 4 in many cases. That
> doesn't
> >> solve the in memory problem, but does have some advantages on disk as
> well
> >> as making for easy display. We could compress it ourselves after
> encoding by
> >> truncation.
> >>
> >> Note that for terminal display we will want something supported by the
> >> system, which is another problem altogether. Let me break the problem
> down
> >> into four categories
> >>
> >> Storage -- hdf5, .npy, fits, etc.
> >> Display -- ?
> >> Modification -- editing
> >> Parsing -- fits, etc.
> >>
> >> There is probably no one solution that is optimal for all of those.
> >>
> >> Chuck
> >>
> >>
> >>
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion at python.org
> >> https://mail.python.org/mailman/listinfo/numpy-discussion
> >>
> >
> >
> > quoting Julian
> >
> > '''
> > I probably have formulated my goal with the proposal a bit better, I am
> > not very interested in a repetition of which encoding to use debate.
> > In the end what will be done allows any encoding via a dtype with
> > metadata like datetime.
> > This allows any codec (including truncated utf8) to be added easily (if
> > python supports it) and allows sidestepping the debate.
> >
> > My main concern is whether it should be a new dtype or modifying the
> > unicode dtype. Though the backward compatibility argument is strongly in
> > favour of adding a new dtype that makes the np.unicode type redundant.
> > '''
> >
> > I don't quite understand why this discussion goes in a direction of an
> > either one XOR the other dtype.
> >
> > I thought the parameterized 1-byte encoding that Julian mentioned
> > initially sounds useful to me.
> >
> > (I'm not sure I will use it much, but I also don't use float16 )
> >
> > Josef
>
> Indeed,
> Most of this discussion is irrelevant to numpy.
> Numpy only really deals with the in memory storage of strings. And in
> that it is limited to fixed length strings (in bytes/codepoints).
> How you get your messy strings into numpy arrays is not very relevant to
> the discussion of a smaller representation of strings.
> You couldn't get messy strings into numpy without first sorting it out
> yourself before, you won't be able to afterwards.
> Numpy will offer a set of encodings, the user chooses which one is best
> for the use case and if the user screws it up, it is not numpy's problem.
>
> You currently only have a few ways to even construct string arrays:
> - array construction and loops
> - genfromtxt (which is again just a loop)
> - memory mapping which I seriously doubt anyone actually does for the S
> and U dtype
>
> Having a new dtype changes nothing here. You still need to create numpy
> arrays from python strings which are well defined and clean.
> If you put something in that doesn't encode you get an encoding error.
> No oddities like surrogate escapes are needed, numpy arrays are not
> interfaces to operating systems nor does numpy need to _add_ support for
> historical oddities beyond what it already has.
> If you want to represent bytes exactly as they came in don't use a text
> dtype (which includes the S dtype, use i1).
>
> Concerning variable sized strings, this is simply not going to happen.
> Nobody is going to rewrite numpy to support it, especially not just for
> something as unimportant as strings.
> Best you are going to get (or better already have) is object arrays. It
> makes no sense to discuss it unless someone comes up with an actual
> proposal and the willingness to code it.
>
>
> What is a relevant discussion is whether we really need a more compact
> but limited representation of text than 4-byte utf32 at all.
> Its usecase is for the most part just for python3 porting and saving
> some memory in some ascii heavy cases, e.g. astronomy.
> It is not that significant anymore as porting to python3 has mostly
> already happened via the ugly byte workaround and memory saving is
> probably not as significant in the context of numpy which is already
> heavy on memory usage.
>
> My initial approach was to not add a new dtype but to make unicode
> parametrizable which would have meant almost no cluttering of numpys
> internals and keeping the api more or less consistent which would make
> this a relatively simple addition of minor functionality for people that
> want it.
> But adding a completely new partially redundant dtype for this usecase
> may be a too large change to the api. Having two partially redundant
> string types may confuse users more than our current status quo of our
> single string type (U).
>
> Discussing whether we want to support truncated utf8 has some merit as
> it is a decision whether to give the users an even larger gun to shot
> themselves in the foot with.
> But I'd like to focus first on the 1 byte type to add a symmetric API
> for python2 and python3.
> utf8 can always be added latter should we deem it a good idea.
>

I think we can implement viewers for strings as ndarray subclasses. Then
one could
do `my_string_array.view(latin_1)`, and so on.  Essentially that just
changes the default
encoding of the 'S' array. That could also work for uint8 arrays if needed.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/266d6591/attachment.html>