[scikit-learn] Any plans on generalizing Pipeline and transformers?

Manuel Castejón Limas manuel.castejon at gmail.com
Fri Dec 22 06:09:55 EST 2017

I'm currently thinking on a computational graph which can then be wrapped
as a pipeline like object ... I'll try yo make a toy example solving my

El 20 dic. 2017 16:33, "Manuel Castejón Limas" <manuel.castejon at gmail.com>

> Thank you all for your interest!
> In order to clarify the case allow me to try to synthesize the spirit of
> what I'd like to put into the pipeline using this sequence of steps:
> #%%
> import pandas as pd
> import numpy as np
> import matplotlib.pyplot as plt
> from sklearn.cluster import DBSCAN
> from sklearn.mixture import GaussianMixture
> from sklearn.model_selection import train_test_split
> np.random.seed(seed=42)
> """
> Data preparation
> """
> URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/
> sin_60_percent_noise.csv"
> data = pd.read_csv(URL, usecols=['V1','V2'])
> X, y = data[['V1']], data[['V2']]
> (data_train, data_test,
>  X_train, X_test,
>  y_train, y_test) = train_test_split(data, X, y)
> """
> Parameters setup
> """
> dbscan__eps = 0.06
> mclust__n_components = 3
> paella__noise_label = -1
> paella__max_it = 20,
> paella__regular_size = 400,
> paella__minimum_size = 100,
> paella__width_r = 0.99,
> paella__n_neighbors = 5,
> paella__power = 30,
> paella__random_state = None
> #%%
> """
> DBSCAN clustering to detect noise suspects (label == -1)
> """
> dbscan_input = data_train
> dbscan_clustering = DBSCAN(eps = dbscan__eps)
> dbscan_output = dbscan_clustering.fit_predict(dbscan_input)
> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
> c=np.int64(dbscan_output == -1))
> #%%
> """
> GaussianMixture fitted with filtered data_train in order to help locate
> the ellipsoids
> but predict is applied to the whole data_train set.
> """
> mclust_input = data_train[ dbscan_output != 1]
> mclust_clustering = GaussianMixture(n_components = mclust__n_components)
> mclust_clustering.fit(mclust_input)
> mclust_output = mclust_clustering.predict(data_train)
> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
> c=mclust_output)
> #%%
> """
> mclust and dbscan results are combined.
> """
> clustering_output = mclust_output.copy()
> clustering_output[dbscan_output == -1] =  -1
> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
> c=clustering_output)
> #%%
> """
> Old-good Paella paper: https://link.springer.
> com/article/10.1023/B:DAMI.0000031630.50685.7c
> The Paella algorithm calculates sample_weight to be used by the final step
> regressor
> (Yes, it is an outlier detection algorithm but we are focusing now on this
> interesting collateral result). I am currently aggressively changing the
> code in order to make it fit somehow with the pipeline
> """
> from paella import Paella
> paella_input = pd.concat([data, clustering_output], axis=1, inplace=False)
> paella_run = Paella(noise_label = paella__noise_label,
>                     max_it = paella__max_it,
>                     regular_size = paella__regular_size,
>                     minimum_size = paella__minimum_size,
>                     width_r = paella__width_r,
>                     n_neighbors = paella__n_neighbors,
>                     power = paella__power,
>                     random_state = paella__random_state)
> paella_output = paella_run.fit_predict(paella_input, y_train)
> # paella_output is a vector with sample_weight
> #%%
> """
> Here we fit a regressor using sample_weight=paella_output
> """
> from sklearn.linear_model import LinearRegression
> regressor_input=X_train
> lm = LinearRegression()
> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
> regressor_output = lm.predict(X_train)
> #...
> In this example we can see that:
> - A particular step might need results produced not necessarily from the
> immediately previous step.
> - The X parameter is not sequentially transformed. Sometimes we might need
> to skip to a previous step
> - y sometimes is the target, sometimes is not. For the regressor it is
> indeed, but for the paella algorithm the prediction is expressed as a
> vector representing sample_weights.
> All in all the conclusion is that the chain of processes is not as linear
> as imposed by the current API. I guess that all these difficulties could be
> solved by:
> - Passing a dictionary through the different steps containing the partial
> results that the following steps will need.
> -  As a christmas gift :-) , a reference to the pipeline itself inserted
> in that dictionary could provide access to the internal status of the
> previous steps should it be needed.
> Another interesting study case with similar needs would be a regressor
> using a previous clustering step in order to fit one model per cluster. In
> such case, the clustering results would be needed during the fitting.
> Thanks for your interest!
> Manolo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171222/d3a5fb7c/attachment.html>

More information about the scikit-learn mailing list