[Numpy-discussion] Setting custom dtypes and 1.14
josef.pktd at gmail.com
josef.pktd at gmail.com
Tue Jan 30 22:09:11 EST 2018
On Tue, Jan 30, 2018 at 7:33 PM, Allan Haldane <allanhaldane at gmail.com>
wrote:
> On 01/30/2018 04:54 PM, josef.pktd at gmail.com wrote:
> >
> >
> > On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhaldane at gmail.com
> > <mailto:allanhaldane at gmail.com>> wrote:
> >
> > On 01/30/2018 01:33 PM, josef.pktd at gmail.com
> > <mailto:josef.pktd at gmail.com> wrote:
> > > AFAICS, one problem is that the padded view didn't come with the
> > > matching down stream usage support, the pack function as
> mentioned, an
> > > alternative way to convert to a standard ndarray, copy doesn't get
> rid
> > > of the padding and so on.
> > >
> > > eg. another mailing list thread I just found with the same problem
> > > http://numpy-discussion.10968.n7.nabble.com/view-of-
> recarray-issue-td32001.html
> > <http://numpy-discussion.10968.n7.nabble.com/view-of-
> recarray-issue-td32001.html>
> > >
> > > quoting Ralf:
> > > Question: is that really the recommended way to get an (N, 2) size
> float
> > > array from two columns of a larger record array? If so, why isn't
> there
> > > a better way? If you'd want to write to that (N, 2) array you have
> to
> > > append a copy, making it even uglier. Also, then there really
> should be
> > > tests for views in test_records.py.
> > >
> > >
> > > This "better way" never showed up, AFAIK. And it looks like we
> came back
> > > to this problem every few years.
> > >
> > > Josef
> >
> > Since we are at least pushing off this change to a later release
> > (1.15?), we have some time to prepare/catch up.
> >
> > What can we add to numpy.lib.recfunctions to make the multi-field
> > copy->view change smoother? We have discussed at least two functions:
> >
> > * repack_fields - rearrange the memory layout of a structured array
> to
> > add/remove padding between fields
> >
> > * structured_to_unstructured - turns a n-D structured array into an
> > (n+1)-D unstructured ndarray, whose dtype is the highest common type
> of
> > all the fields. May want the inverse function too.
> >
> >
> > The only sticky point with statsmodels is to have an equivalent of
> > a[['b', 'c']].view(('f8', 2)).
> >
> > Highest common dtype might be object, the main usecase for this is to
> > select some elements of a specific dtype and then use them as
> > standard,homogeneous ndarray. In our case and other cases that I have
> > seen it is mainly to select a subset of the floating point numbers.
> > Another case of this might be to combine two strings into one a[['b',
> > 'c']].view(('S8')) if b is s5 and c is S3, but I don't think I used
> > this in serious code.
>
> I implemented and put up a draft of these functions in
> https://github.com/numpy/numpy/pull/10411
Comments based on reading the last commit
>
>
> I think they satisfy all your cases: code like
>
> >>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')])
> >>> a[['b', 'c']].view(('f8', 2))`
>
> becomes:
>
> >>> import numpy.lib.recfunctions as rf
> >>> rf.structured_to_unstructured(a[['b', 'c']])
> array([[1., 1.],
> [1., 1.],
> [1., 1.]])
>
> The highest common dtype is usually not "Object", since I use
> `np.result_type` to determine the output type. So two fields of 'S5' and
> 'S3' result in an 'S5' array.
>
>
structured_to_unstructured looks good to me
>
> >
> > for inverse function: I guess it is still possible to view any standard
> > homogenous ndarray with a structured dtype as long as the itemsize
> matches.
>
> The inverse is implemented too. And it even supports varied field
> dtypes, nested fields, and subarrays, as you can see in the docstring
> examples.
>
>
> > Browsing through old mailing list threads, I saw that adding multiple
> > fields or concatenating two arrays with structured dtypes into an array
> > with a single combined dtype was missing and I guess still is. (IIRC
> > this is the usecase where we go now the pandas detour in statsmodels.)
> >
> > We might also consider
> >
> > * apply_along_fields(arr, method) - applies the method along the
> > "field" axis, equivalent to something like
> > method(struct_to_unstructured(arr), axis=-1)
> >
> >
> > If this works on a padded view of an existing array, then this would be
> > an improvement over the current version of having to extract and copy
> > the relevant fields of an existing structured dtype or loop over
> > different numeric dtypes, ints, floats.
> >
> > In general there will need to be a way to apply `method` only to
> > selected columns, or columns of a matching dtype. (e.g. We don't want
> > the sum or mean of a string.)
> > (e.g. we use ptp() on numeric fields to check if there is already a
> > constant column in the array or dataframe)
>
> Means over selected columns are accounted for using multi-field
> indexing. For example:
>
> >>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)],
> ... dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')])
>
> >>> rf.apply_along_fields(np.mean, b)
> array([ 2.66666667, 5.33333333, 8.66666667, 11. ])
>
> >>> rf.apply_along_fields(np.mean, b[['x', 'z']])
> array([ 3. , 5.5, 9. , 11. ])
>
actually, I would have expected apply_along_columns, i.e. reduce over all
observations each field.
This might need an axis argument.
However, in the current form it is less practical than doing it ourselves
with structured_to_unstructured because it makes a copy each time of all
elements.
e.g.
rf.apply_along_fields(np.mean, b[['x', 'z']])
rf.apply_along_fields(np.std, b[['x', 'z']])
would do the same structured_to_unstructured copy of all array elements
twice.
Josef
>
>
> This is unaffected by the 1.14 to 1.15 changes.
>
> Allan
>
> >
> >
> >
> >
> >
> > I think these are pretty minimal and shouldn't be too hard to
> implement.
> >
> >
> > AFAICS, it would cover the statsmodels usage.
> >
> >
> > Josef
> >
> >
> >
> >
> > Allan
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> > <https://mail.python.org/mailman/listinfo/numpy-discussion>
> >
> >
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180130/c7aa8f24/attachment-0001.html>
More information about the NumPy-Discussion
mailing list