[Numpy-discussion] String type again.

Julian Taylor jtaylor.debian at googlemail.com
Fri Jul 18 11:10:53 EDT 2014


On Thu, Jul 17, 2014 at 5:48 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>> Thinking more about it, the easiest thing to do might be to make the S dtype
>> a UTF-8 encoding. Most of the machinery to deal with that is already in
>> place. That change might affect some users though, and we might need to do
>> some work to make it backwards compatible with python 2.
>
> I'd be very concerned about backcompat for existing code that uses
> e.g. "S128" as a dtype to mean "128 arbitrary bytes". An example is
> this file format reading code:
>    https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L123
> The file format says there are 128 bytes there, and their
> interpretation depends on other fields in the header -- but in one
> case, for "large montages", there's an encoding where every 3 bytes
> represents 4 characters using an ad hoc 6-bit character set:
>    https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L133
>
> Perhaps this case could be handled better by using a u8 subarray or
> something (that code also goes to some efforts to work around nul
> padding), and that particular project hasn't been ported to py3 yet so
> technically wouldn't be affected if we changed the meaning of "S" on
> py3. But it does seem useful to have a "fixed length bytes" dtype even
> in py3, and if we declare that be "S" then it avoids breaking any
> existing code depending on it...
>

We break code either way.
Either we break applications using S as string type, but now it
becomes bytes in python3.
Or we break applications treating S as byte type and we change it to
string in python3.

Unfortunately we missed the opportunity when adding python3 support to
fix the same exact same bytes/text boundary issue which is the main
reason why pythons3 exists in the first place.
We should have made porting to numpy3 a intentionally(!) backward
incompatible change just like python itself did.

Now we are stuck with deciding, which option breaks less.
On the one hand, that S is bytes in python3 is somewhat established by
now and lots of workarounds are already place.
On the other hand, I think code that relies on S being bytes is in the
minority and python3 usage is probably still  insignificant in this
area. Unfortunately getting actual numbers and not wild guesses on
this is probably not easy.



More information about the NumPy-Discussion mailing list