[Numpy-discussion] String type again.

Fri Jul 18 11:10:53 EDT 2014

On Thu, Jul 17, 2014 at 5:48 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>> Thinking more about it, the easiest thing to do might be to make the S dtype
>> a UTF-8 encoding. Most of the machinery to deal with that is already in
>> place. That change might affect some users though, and we might need to do
>> some work to make it backwards compatible with python 2.
>
> I'd be very concerned about backcompat for existing code that uses
> e.g. "S128" as a dtype to mean "128 arbitrary bytes". An example is
> this file format reading code:
>    https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L123
> The file format says there are 128 bytes there, and their
> interpretation depends on other fields in the header -- but in one
> case, for "large montages", there's an encoding where every 3 bytes
> represents 4 characters using an ad hoc 6-bit character set:
>    https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L133
>
> Perhaps this case could be handled better by using a u8 subarray or
> something (that code also goes to some efforts to work around nul
> padding), and that particular project hasn't been ported to py3 yet so
> technically wouldn't be affected if we changed the meaning of "S" on
> py3. But it does seem useful to have a "fixed length bytes" dtype even
> in py3, and if we declare that be "S" then it avoids breaking any
> existing code depending on it...
>

We break code either way.
Either we break applications using S as string type, but now it
becomes bytes in python3.
Or we break applications treating S as byte type and we change it to
string in python3.

Unfortunately we missed the opportunity when adding python3 support to
fix the same exact same bytes/text boundary issue which is the main
reason why pythons3 exists in the first place.
We should have made porting to numpy3 a intentionally(!) backward
incompatible change just like python itself did.

Now we are stuck with deciding, which option breaks less.
On the one hand, that S is bytes in python3 is somewhat established by
now and lots of workarounds are already place.
On the other hand, I think code that relies on S being bytes is in the
minority and python3 usage is probably still  insignificant in this
area. Unfortunately getting actual numbers and not wild guesses on
this is probably not easy.