[Numpy-discussion] proposal: smaller representation of string arrays

Neal Becker ndbecker2 at gmail.com
Thu Apr 20 13:36:42 EDT 2017


I'm no unicode expert, but can't we truncate unicode strings so that only
valid characters are included?

On Thu, Apr 20, 2017 at 1:32 PM Chris Barker <chris.barker at noaa.gov> wrote:

> On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted at gmail.com
> > wrote:
>
>> Is there any reason not to support all Unicode encodings that python
>> does, with the same names and semantics? This would surely be the simplest
>> to understand.
>>
>
> I think it should support all fixed-length encodings, but not the
> non-fixed length ones -- they just don't fit well into the numpy data model.
>
>
>> Also, if latin1 is to going to be the only practical 8-bit encoding,
>> maybe check with some non-Western users to make sure it's not going to
>> wreck their lives? I'd have selected ASCII as an encoding to treat
>> specially, if any, because Unicode already does that and the consequences
>> are familiar. (I'm used to writing and reading French without accents
>> because it's passed through ASCII, for example.)
>>
>
> latin-1 (or latin-9) only makes things better than ASCII -- it buys most
> of the accented characters for the European language and some symbols that
> are nice to have (I use the degree symbol a lot...). And it is ASCII
> compatible -- so there is NO reason to choose ASCII over Latin-*
>
> Which does no good for non-latin languages -- so we need to hear from the
> community -- is there a substantial demand for a non-latin one-byte per
> character encoding?
>
>
>> Variable-length encodings, of which UTF-8 is obviously the one that makes
>> good handling essential, are indeed more complicated. But is it strictly
>> necessary that string arrays hold fixed-length *strings*, or can the
>> encoding length be fixed instead? That is, currently if you try to assign a
>> longer string than will fit, the string is truncated to the number of
>> characters in the data type.
>>
>
> we could do that, yes, but an improperly truncated "string" becomes
> invalid -- just seems like a recipe for bugs that won't be found in testing.
>
> memory is cheap, compressing is fast -- we really shouldn't get hung up on
> this!
>
> Note: if you are storing a LOT of text (which I have no idea why you would
> use numpy anyway), then the memory size might matter, but then
> semi-arbitrary truncation would probably matter, too.
>
> I expect most text storage in numpy arrays is things like names of
> datasets, ids, etc, etc -- not massive amounts of text -- so storage space
> really isn't critical. but having an id or something unexpectedly truncated
> could be bad.
>
> I think practical experience has shown us that people do not handle
> "mostly fixed length but once in awhile not" text well -- see the nightmare
> of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so
> errors are far more likely to be found in tests (why would you use utf-8 is
> all your data are in ascii???). but still -- why invite hard to test for
> errors?
>
> Final point -- as Julian suggests, one reason to support utf-8 is for
> interoperability with other systems -- but that makes errors more of an
> issue -- if it doesn't pass through the numpy truncation machinery, invalid
> data could easily get put in a numpy array.
>
> -CHB
>
>  it would allow UTF-8 to be used just the way it usually is - as an
>> encoding that's almost 8-bit.
>>
>
> ouch! that perception is the route to way too many errors! it is by no
> means almost 8-bit, unless your data are almost ascii -- in which case, use
> latin-1 for pity's sake!
>
> This highlights my point though -- if we support UTF-8, people WILL use
> it, and only test it with mostly-ascii text, and not find the bugs that
> will crop up later.
>
> All this said, it seems to me that the important use cases for string
>> arrays involve interaction with existing binary formats, so people who have
>> to deal with such data should have the final say. (My own closest approach
>> to this is the FITS format, which is restricted by the standard to ASCII.)
>>
>
> yup -- not sure we'll get much guidance here though -- netdf does not
> solve this problem well, either.
>
> But if you are pulling, say, a utf-8 encoded string out of a netcdf file
> -- it's probably better to pull it out as bytes and pass it through the
> python decoding/encoding machinery than pasting the bytes straight to a
> numpy array and hope that the encoding and truncation are correct.
>
> -CHB
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/b6d867b3/attachment-0001.html>


More information about the NumPy-Discussion mailing list