[Numpy-discussion] Py3 merge

Mon Dec 7 11:54:11 EST 2009

Pauli Virtanen wrote:
> ma, 2009-12-07 kello 09:50 -0500, Michael Droettboom kirjoitti:
>   
>> Pauli Virtanen wrote:
>>     
> [clip]
>   
>>> The character 'B' is already by unsigned bytes -- I wonder if it's easy
>>> to support 'B123' and plain 'B' at the same time, or whether we have to
>>> pick a different letter for "byte strings". 'y' would be free...
>>>       
>> It seems to me the motivation to change the 'S' dtype to something else 
>> is to make things clearer with respect to the new conventions of Python 
>> 3.  (Where str -> bytes, and unicode -> str). In that sense, I'm not 
>> sure there's any advantage going from "S" to "y" (particularly without 
>> doing "U" to "S"), whereas there's a strong backward-compatibility 
>> advantage to keep it as "S", though admittedly it's confusing to someone 
>> who doesn't know the pre Python 3 history. 
>>     
>
> I think a better plan is to deprecate "U" instead of "S".
>
> Also, I'm not completely convinced that staying with "S" == bytes has a
> strong backward-compatibility advantage:
>
> 	array(['foo']).dtype == 'U'
>
> and this will break code in several places. Also, for instance,
>
> 	array(['foo', 'bar'], dtype='S3')
>
> will result to encoding errors. We probably don't want to start
> implicitly casting Unicode to bytes, since Py3 does not do that either.
> The only places where the dtype characters are used, AFAIK, is in repr
> and in the dtype kwarg -- they are not used in pickles etc.
>
> One can actually argue that changing 'U' to 'S' is more
> backward-compatible:
>
> 	array(['foo', 'bar'], dtype='S3')
>
> would still be valid code. Of course, the semantics change, but this
> anyway occurs also on the Python side when moving to Py3.
>
> The simplest way to get more insight would be to try to convert some
> string-using Py2 code to work on Py3.
>   
Ok -- I think I can see that argument.  Our use case is to define 
structured arrays to read and write binary files, which means we will 
have to change our dtypes from 'S8' to 'B8' in this case, or risk having 
the fields be the wrong size.  It's very rare for our code to create 
arrays using string literals, so this problem hadn't occurred to me.  I 
think 'U' will have to change to 'S', and users defining structured 
arrays will just have to make this change.
>   
>> I'm not sure your suggestion of making 'B' and 'B123' both work seems 
>> like a good one because of the semantic differences between numbers and 
>> strings. Would np.array(['a', 'b']) have a repr of [97, 98] or ['a', 
>> 'b']?  Sorting them would also not necessarily do the right thing.
>>     
>
> I think the point would be that 'B' and 'B1' would be treated as
> completely separate dtypes with different typenums -- they'd look
> similar only in the dtype character API (which is not so large) but not
> internally. np.array([b'a', b'b']).dtype would be 'B1'. Might be a bit
> confusing, though.
>   
I see.  I didn't quite understand what you were suggesting.  I suppose 
that's not a bad compromise.  Would the "kind" attribute be different 
between bytes and byte strings?  I worry about code that does something 
like:

  if x.dtype.kind == 'B':
    ...

...which is not great usage, (issubclass(x.dtype.type, np.byte) would be 
better) but one sees it in user code in the wild (and even in Numpy 
itself) now and then.

Mike

-- 
Michael Droettboom
Science Software Branch
Operations and Engineering Division
Space Telescope Science Institute
Operated by AURA for NASA