[scikit-learn] Why is subset invariance necessary for transfom()?

Sat Jan 25 10:52:36 EST 2020

To summarize - for mds, spectralembedding, it looks like there is no
transform method that will satisfy both

  1. fit(X).transform(X)  == fit_transform(X)
  2. transform(X)[i:i+1] == transform(X[i:i+1])

that's because the current fit_transform doesn't factor nicely into those 2
steps. The last step returns a subset of eigenvalues of a modified Gram
matrix.

For PCA, kernel PCA, LLE, fit_transform is something like: center data, do
U,S,V = SVD, project data onto submatrix of V. The last step is matrix
multiplication. Last step in transform methods there are np.dot(...). That
factors nicely.

There could be a transform_batch method for mds which would satisfy 1.,
then transform could call transform_batch rowwise, to satisfy 2, but no
single method will work.

I don't know if there is appetite for separation and modification of
unittests involved.

Charles

On Tue, Jan 21, 2020 at 9:19 PM Charles Pehlivanian <
pehlivaniancharles at gmail.com> wrote:

>     This is what I thought we usually do. It looks like you said we are
> doing a greedy transform.
>     I'm not sure I follow that. In particular for spectral embedding for
> example there is a pretty way to describe
>     the transform and that's what we're doing. You could also look at
> doing transductive learning but that's
>     not really the standard formulation, is it?
>
> Batch transform becomes greedy if one does:
>
>         for x_i in X:
>             X_new_i = self.transform(x_i)
>
> I said that LLE uses greedy algorithm. The algorithm implemented is
> pointwise. It may be that that's the only approach (in which case it's not
> greedy), but I don't think so - looks like all of the spectral embedding,
> lle, mds transforms have batch versions. So I probably shouldn't call it
> greedy. Taking a *true* batch transform and enclosing it in a loop like
> that - I'm calling that greedy. I'm honestly not sure if the LLE qualifies.
>
> Spectral embedding - agree, the method you refer to is implemented in
> fit_transform(). How to apply to oos points?
>
> Non-distributable, non-subset-invariant, optimal batch transform
>     Can you give an example of that?
>
> Most of the manifold learners can be expressed as solutions to
> eigenvalue/vector problems. For MDS batch transform, form a new constrained
> double-centered distance matrix and solve a constrained least-squares
> problem that mimics the SVD solution to the eigenvalue problem.  They're
> all like this - least-squares estimates for some constrained eigenvalue
> problem. The question is whether you want to solve the full problem, or
> solve on each point, adding one row and optimzing each time, ... that would
> be subset-invariant though.
>
> For this offline/batch approach to an oos transform, the only way I see to
> make it pass tests is to enclose it in a loop as above. That's what I see
> at least.
>
>
> On Tue, Jan 21, 2020 at 8:35 PM Andreas Mueller <t3kcit at gmail.com> wrote:
>
>>
>>
>> On 1/21/20 8:23 PM, Charles Pehlivanian wrote:
>>
>> I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup...
>>
>> I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant.
>>
>>
>> I see 3 options here - all I can say is that I don't vote for the first
>>
>> + No transform method on the manifold learners, so no cross-validation
>>
>> This is what I thought we usually do. It looks like you said we are doing
>> a greedy transform.
>> I'm not sure I follow that. In particular for spectral embedding for
>> example there is a pretty way to describe
>> the transform and that's what we're doing. You could also look at doing
>> transductive learning but that's
>> not really the standard formulation, is it?
>>
>> + Pointwise, distributable, subset-invariant, suboptimal greedy transform
>>
>> + Non-distributable, non-subset-invariant, optimal batch transform
>>
>> Can you give an example of that?
>>
>> -Charles
>>
>> On Mon., Jan. 20, 21:24:52 2020 <joel.nothman at gmail.com <scikit-learn%40python.org?Subject=Re%3A%20%5Bscikit-learn%5D%20Why%20is%20subset%20invariance%20necessary%20for%0A%20transfom%28%29%3F&In-Reply-To=%3CCAAkaFLWfWyu%2BDdQ3RX5tBays6jLX6A3W_QpqAcWn_RAxbRz5cQ%40mail.gmail.com%3E>> wrote
>>
>> I think allowing subset invariance to not hold is making stronger
>>
>> assumptions than we usually do about what it means to have a "test set".
>> Having a transformation like this that relies on test set statistics
>> implies that the test set is more than just selected samples, but rather
>> that a large collection of samples is available at one time, and that it is
>> in some sense sufficient or complete (no more samples are available that
>> would give a better fit). So in a predictive modelling context you might
>> have to set up your cross validation splits with this in mind.
>>
>> In terms of API, the subset invariance constraint allows us to assume that
>> the transformation can be distributed or parallelized over samples. I'm not
>> sure whether we have exploited that assumption within scikit-learn or
>> whether related projects do so.
>>
>> I see the benefit of using such transformations in a prediction Pipeline,
>> and really appreciate this challenge to our assumptions of what "transform"
>> means.
>>
>> Joel
>>
>> On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, <pehlivaniancharles at gmail.com <https://mail.python.org/mailman/listinfo/scikit-learn>> wrote:
>>
>> >* Not all data transformers have a transform method. For those that do,
>> *>* subset invariance is assumed as expressed
>> *>* in check_methods_subset_invariance(). It must be the case that
>> *>* T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic
>> *>* projections - PCA, kernel PCA, etc., but not for some manifold learning
>> *>* transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement
>> *>* of the data in space is a constrained optimization, may take into account
>> *>* the centroid of the dataset etc.
>> *>>* The manifold learners have "batch" oos transform() methods that aren't
>> *>* implemented, and wouldn't pass that test. Instead, those that do -
>> *>* LocallyLinearEmbedding - use a pointwise version, essentially replacing a
>> *>* batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]:
>> *>>*     for i in range(X.shape[0]):
>> *>*         X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i])
>> *>>* Where to implement the batch transform() methods for MDS,
>> *>* SpectralEmbedding, LocallyLinearEmbedding, etc?
>> *>>* Another verb? Both batch and pointwise versions? The latter is easy to
>> *>* implement once the batch version exists. Relax the test conditions?
>> *>* transform() is necessary for oos testing, so necessary for cross
>> *>* validation. The batch versions should be preferred, although as it stands,
>> *>* the pointwise versions are.
>> *>>* Thanks
>> *>* Charles Pehlivanian
>> *>* _______________________________________________
>> *>* scikit-learn mailing list
>> *>* scikit-learn at python.org <https://mail.python.org/mailman/listinfo/scikit-learn>
>> *>* https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
>> *>-------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200121/b402c42e/attachment.html>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200125/1d065709/attachment.html>