[Numpy-discussion] when did column_stack become C-contiguous?

Mon Oct 19 21:51:10 EDT 2015

On Mon, Oct 19, 2015 at 9:15 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Mon, Oct 19, 2015 at 5:55 AM,  <josef.pktd at gmail.com> wrote:
> >
> >
> > On Mon, Oct 19, 2015 at 2:14 AM, Nathaniel Smith <njs at pobox.com> wrote:
> >>
> >> On Sun, Oct 18, 2015 at 9:35 PM,  <josef.pktd at gmail.com> wrote:
> >> >>>> np.column_stack((np.ones(10), np.ones(10))).flags
> >> >   C_CONTIGUOUS : True
> >> >   F_CONTIGUOUS : False
> >> >
> >> >>>> np.__version__
> >> > '1.9.2rc1'
> >> >
> >> >
> >> > on my notebook which has numpy 1.6.1 it is f_contiguous
> >> >
> >> >
> >> > I was just trying to optimize a loop over variable adjustment in
> >> > regression,
> >> > and found out that we lost fortran contiguity.
> >> >
> >> > I always thought column_stack is for fortran usage (linalg)
> >> >
> >> > What's the alternative?
> >> > column_stack was one of my favorite commands, and I always assumed we
> >> > have
> >> > in statsmodels the right memory layout to call the linalg libraries.
> >> >
> >> > ("assumed" means we don't have timing nor unit tests for it.)
> >>
> >> In general practice no numpy functions make any guarantee about memory
> >> layout, unless that's explicitly a documented part of their contract
> >> (e.g. 'ascontiguous', or some functions that take an order= argument
> >> -- I say "some" b/c there are functions like 'reshape' that take an
> >> argument called order= that doesn't actually refer to memory layout).
> >> This isn't so much an official policy as just a fact of life -- if
> >> no-one has any idea that the someone is depending on some memory
> >> layout detail then there's no way to realize that we've broken
> >> something. (But it is a good policy IMO.)
> >
> >
> > I understand that in general.
> >
> > However, I always thought column_stack is a array creation function which
> > have guaranteed memory layout. And since it's stacking by columns I
> thought
> > that order is always Fortran.
> > And the fact that it doesn't have an order keyword yet, I thought is
> just a
> > missing extension.
>
> I guess I don't know what to say except that I'm sorry to hear that
> and sorry that no-one noticed until several releases later.
>

Were there more contiguity changes in 0.10?
I just saw a large number of test errors and failures in statespace models
which are heavily based on cython code where it's not just a question of
performance.

I don't know yet what's going on, but I just saw that we have some explicit
tests for fortran contiguity which just started to fail.

>
> >> If this kind of problem gets caught during a pre-release cycle then we
> >> generally do try to fix it, because we try not to break code, but if
> >> it's been broken for 2 full releases then there's no much we can do --
> >> we can't go back in time to fix it so it sounds like you're stuck
> >> working around the problem no matter what (unless you want to refuse
> >> to support 1.9.0 through 1.10.1, which I assume you don't... worst
> >> case, you just have to do a global search replace of np.column_stack
> >> with statsmodels.utils.column_stack_f, right?).
> >>
> >> And the regression issue seems like the only real argument for
> >> changing it back -- we'd never guarantee f-contiguity here if starting
> >> from a blank slate, I think?
> >
> >
> > When the cat is out of the bag, the down stream developer writes
> > compatibility code or helper functions.
> >
> > I will do that at at least the parts I know are intentionally designed
> for F
> > memory order.
> >
> > ---
> >
> > statsmodels doesn't really check or consistently optimize the memory
> order,
> > except in some cython functions.
> > But, I thought we should be doing quite well with getting Fortran ordered
> > arrays. I only paid attention where we have more extensive loops
> internally.
> >
> > Nathniel, Does patsy guarantee memory layout (F-contiguous) when creating
> > design matrices?
>
> I never thought about it :-). So: no, it looks like right now patsy
> usually returns C-order matrices (or really, whatever np.empty or
> np.repeat returns), and there aren't any particular guarantees that
> this will continue to be the case in the future.
>
> Is returning matrices in F-contiguous layout really important? Should
> there be a return_type="fortran_matrix" option or something like that?
>

I don't know, yet. My intuition was that it would be better because we feed
the arrays directly to pinv/SVD or QR which, I think, require by default
Fortran contiguous.

However, my intuition might not be correct, and it might not make much
difference in a single OLS estimation.

There are a few critical loops in variable selection that I'm planning to
investigate to see how much it matters.
Memory optimization was never high in our priority compared to expanding
the functionality overall, but reading the Julia mailing list is starting
to worry me a bit. :)

(I'm even starting to see the reason for multiple dispatch.)

Josef

>
> -n
>
> --
> Nathaniel J. Smith -- http://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151019/92903085/attachment.html>