[SciPy-User] interpolation question

Wed Sep 23 16:42:33 EDT 2009

On Wed, Sep 23, 2009 at 12:10 PM, Oosthoek, J.H.P. (Jelmer)
<jelmer.oosthoek at tno.nl> wrote:
> Dear Zach and Josef,
>
> Thanks for your help! I think this will be more than enough for me to get
> this working :)
>
> Thanks!
>
> Jelmer
> ________________________________
> Van: josef.pktd at gmail.com
> Verzonden: wo 9/23/2009 16:17
> Aan: SciPy Users List
> Onderwerp: Re: [SciPy-User] interpolation question
>
> On Wed, Sep 23, 2009 at 9:53 AM, Zachary Pincus <zachary.pincus at yale.edu>
> wrote:
>>> Is it possible using scipy to fill the gaps in one list using
>>> another list?
>>>
>>> I have two lists, one with time values and one with meter values.
>>> Both lists are the same length and their indexes are linked
>>> (timelist[i] and depthlist[i] belong to the same location). The
>>> timevalue list has gaps (nodata values).
>>>
>>> The shape of both lists is very similar. What I would like to do is
>>> to use the depthlist shape to fill in the gaps of the timelist. Is
>>> there a method within scipy which does the trick? I first tried
>>> simple regression but that didn't result in a smooth line. I would
>>> like to be able to use the values bordering each gap as control
>>> points.
>>
>>
>> First: do these (depth, time) pairs, sorted by depth, describe a
>> proper function (e.g. no depth with two associated times, etc.)? If
>> not, you'll need to figure out a parametric representation of the
>> function: e.g. use the indices (i, depth) and (i, time) to
>> parameterized a 2D curve.
>>
>> Assuming the former case, the first thing you'll need to do is make
>> three lists: depth_in with the depth values which have associated
>> times, time_in with those time values, and depth_out, with the depth
>> values that have no associated times.
>>
>> times_out = numpy.interp(depth_out, depth_in, time_in) will then
>> linearly interpolate the time values for the depths in the depth_out
>> list.
>>
>> You can also use the spline interpolation routines in
>> scipy.interpolate. There are some convenient wrapper classes, but the
>> ones I use most often are the raw fitpack routines:
>>
>> smoothing = upper bound on sum of squared distances between the fit
>> curve and the (depth, time) input points. Can be zero, but then the
>> spline might be prone to ringing.
>> order = order of spline interpolation: 0 is nearest-neighbor, 1 is
>> linear, 3 is recommended in general.
>> tck = scipy.interpolate.splrep(depth_in, time_in, k=order, s=smoothing)
>> times_out = scipy.interpolate.splev(depth_out, tck)
>>
>> You'll probably want to look at what times the spline provides for the
>> depths provided in depth_in, to gauge if the curve is being over-
>> smoothed.
>>
>> In the parametric case, you'll want to either do two rounds of linear
>> interpolation for the (i, depth) and (i, time) curves, or use
>> scipy.interpolate.splprep.
>>
>> Also there is scipy.interpolate.Rbf which uses radial basis functions
>> to interpolate scattered data points. This might be a very smooth way
>> of interpolating your data as well.
>>
>> Zach
>>
>
> If you want to use the regression approach instead of interpolation, then
> I would try kernel regression, which would locally be much smoother than
> linear regression, but might in the outcome be very similar to
> scipy.interpolate.Rbf.
> But, except for some toy examples, I haven't used this.

If you are interested, you could try out the attached code. I wrote this some
time ago and posted it to the mailing list, but I don't think it has seen any
real use. Eventually, I would like to include something like this in
scikits.statsmodels.

Josef

>
> Josef
>
>>
>>
>>
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
> This e-mail and its contents are subject to the DISCLAIMER at
> http://www.tno.nl/disclaimer/email.html
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
-------------- next part --------------
'''Kernel Ridge Regression for local non-parametric regression'''

import numpy as np
from scipy import spatial as ssp
from numpy.testing import assert_equal
import matplotlib.pylab as plt

def plt_closeall(n=10):
    '''close a number of open matplotlib windows'''
    for i in range(n): plt.close()

def kernel_rbf(x,y,scale=1, **kwds):
    #scale = kwds.get('scale',1)
    dist = ssp.minkowski_distance_p(x[:,np.newaxis,:],y[np.newaxis,:,:],2)
    return np.exp(-0.5/scale*(dist))

def kernel_euclid(x,y,p=2, **kwds):
    return ssp.minkowski_distance(x[:,np.newaxis,:],y[np.newaxis,:,:],p)

class GaussProcess(object):
    '''class to perform kernel ridge regression (gaussian process)

    Warning: this class is memory intensive, it creates nobs x nobs distance
    matrix and its inverse, where nobs is the number of rows (observations).
    See sparse version for larger number of observations

    Notes
    -----

    Todo:
    * normalize multidimensional x array on demand, either by var or cov
    * add confidence band
    * automatic selection or proposal of smoothing parameters

    Reference
    ---------

    Rasmussen, C.E. and C.K.I. Williams, 2006, Gaussian Processes for Machine
    Learning, the MIT Press, www.GaussianProcess.org/gpal, chapter 2
    '''

    def __init__(self, x,y=None, kernel=kernel_rbf,
                 scale=0.5, ridgecoeff = 1e-10, **kwds ):    
        '''        
        Parameters
        ----------
        x : 2d array (N,K)
           data array of explanatory variables, columns represent variables
           rows represent observations
        y : 2d array (N,1) (optional)
           endogenous variable that should be fitted or predicted
           can alternatively be specified as parameter to fit method
        kernel : function, default: kernel_rbf
           kernel: (x1,x2)->kernel matrix is a function that takes as parameter
           two column arrays and return the kernel or distance matrix
        scale : float (optional)
           smoothing parameter for the rbf kernel
        ridgecoeff : float (optional)
           coefficient that is multiplied with the identity matrix in the
           ridge regression

        Notes
        -----
        After initialization, kernel matrix is calculated and if y is given
        as parameter then also the linear regression parameter and the
        fitted or estimated y values, yest, are calculated. yest is available
        as an attribute in this case.

        Both scale and the ridge coefficient smooth the fitted curve.

        '''

        self.x = x
        self.kernel = kernel
        self.scale = scale
        self.ridgecoeff = ridgecoeff
        self.distxsample = kernel(x,x,scale=scale)
        self.Kinv = np.linalg.inv(self.distxsample +
                             np.eye(*self.distxsample.shape)*ridgecoeff)
        if not y is None:
            self.y = y
            self.yest = self.fit(y)

    def fit(self,y):
        '''fit the training explanatory variables to a sample ouput variable'''
        self.parest = np.dot(self.Kinv,y)
        yhat = np.dot(self.distxsample,self.parest)
        return yhat

##        print ds33.shape
##        ds33_2 = kernel(x,x[::k,:],scale=scale)
##        dsinv = np.linalg.inv(ds33+np.eye(*distxsample.shape)*ridgecoeff)
##        B = np.dot(dsinv,y[::k,:])
    def predict(self,x):
        '''predict new y values for a given array of explanatory variables'''
        self.xpredict = x
        distxpredict = self.kernel(x,self.x,scale=self.scale)
        self.ypredict = np.dot(distxpredict,self.parest)
        return self.ypredict

    def plot(self, y, plt=plt ):
        '''some basic plots'''
        #todo return proper graph handles
        plt.figure();
        plt.plot(self.x,self.y,'bo-',self.x,self.yest,'r.-')
        plt.title('sample (training) points')
        plt.figure()
        plt.plot(self.xpredict,y,'bo-',self.xpredict,self.ypredict,'r.-')
        plt.title('all points')

def example1():
    m,k = 500,4
    upper = 6
    scale=10
    xs1a = np.linspace(1,upper,m)[:,np.newaxis]
    xs1 = xs1a*np.ones((1,4)) + 1/(1.0+np.exp(np.random.randn(m,k)))
    xs1 /= np.std(xs1[::k,:],0)   # normalize scale, could use cov to normalize
    y1true = np.sum(np.sin(xs1)+np.sqrt(xs1),1)[:,np.newaxis]
    y1 = y1true + 0.250 * np.random.randn(m,1)

    stride = 2 #use only some points as trainig points e.g 2 means every 2nd
    gp1 = GaussProcess(xs1[::stride,:],y1[::stride,:], kernel=kernel_euclid,
                       ridgecoeff=1e-10)
    yhatr1 = gp1.predict(xs1)
    plt.figure()
    plt.plot(y1true, y1,'bo',y1true, yhatr1,'r.')
    plt.title('euclid kernel: true y versus noisy y and estimated y')
    plt.figure()
    plt.plot(y1,'bo-',y1true,'go-',yhatr1,'r.-')
    plt.title('euclid kernel: true (green), noisy (blue) and estimated (red) '+
              'observations')

    gp2 = GaussProcess(xs1[::stride,:],y1[::stride,:], kernel=kernel_rbf,
                       scale=scale, ridgecoeff=1e-1)
    yhatr2 = gp2.predict(xs1)
    plt.figure()
    plt.plot(y1true, y1,'bo',y1true, yhatr2,'r.')
    plt.title('rbf kernel: true versus noisy (blue) and estimated (red) observations')
    plt.figure()
    plt.plot(y1,'bo-',y1true,'go-',yhatr2,'r.-')
    plt.title('rbf kernel: true (green), noisy (blue) and estimated (red) '+
              'observations')
    #gp2.plot(y1)

def example2(m=100, scale=0.01, stride=2):
    #m,k = 100,1
    upper = 6
    xs1 = np.linspace(1,upper,m)[:,np.newaxis]
    y1true = np.sum(np.sin(xs1**2),1)[:,np.newaxis]/xs1
    y1 = y1true + 0.05*np.random.randn(m,1)

    ridgecoeff = 1e-10
    #stride = 2 #use only some points as trainig points e.g 2 means every 2nd
    gp1 = GaussProcess(xs1[::stride,:],y1[::stride,:], kernel=kernel_euclid,
                       ridgecoeff=1e-10)
    yhatr1 = gp1.predict(xs1)
    plt.figure()
    plt.plot(y1true, y1,'bo',y1true, yhatr1,'r.')
    plt.title('euclid kernel: true versus noisy (blue) and estimated (red) observations')
    plt.figure()
    plt.plot(y1,'bo-',y1true,'go-',yhatr1,'r.-')
    plt.title('euclid kernel: true (green), noisy (blue) and estimated (red) '+
              'observations')

    gp2 = GaussProcess(xs1[::stride,:],y1[::stride,:], kernel=kernel_rbf,
                       scale=scale, ridgecoeff=1e-2)
    yhatr2 = gp2.predict(xs1)
    plt.figure()
    plt.plot(y1true, y1,'bo',y1true, yhatr2,'r.')
    plt.title('rbf kernel: true versus noisy (blue) and estimated (red) observations')
    plt.figure()
    plt.plot(y1,'bo-',y1true,'go-',yhatr2,'r.-')
    plt.title('rbf kernel: true (green), noisy (blue) and estimated (red) '+
              'observations')
    #gp2.plot(y1)

if __name__ == '__main__':
    example2()
    #example2(m=1000, scale=0.01)
    #example2(m=100, scale=0.5)   # oversmoothing
    #example2(m=2000, scale=0.005) # this looks good for rbf, zoom in
    #example2(m=200, scale=0.01,stride=4)  
    example1()
    plt.show()
    #plt_closeall()   # use this to close the open figure windows
-------------- next part --------------

import numpy as np
import matplotlib.pyplot as plt
from kernridgeregress_class import GaussProcess, kernel_euclid

m,k = 50,4
upper = 6
scale=10
xs1 = np.linspace(1,upper,m)[:,np.newaxis]
#xs1 = xs1a*np.ones((1,4)) + 1/(1.0+np.exp(np.random.randn(m,k)))
#xs1 /= np.std(xs1[::k,:],0)   # normalize scale, could use cov to normalize
y1true = np.sum(np.sin(xs1)+np.sqrt(xs1),1)[:,np.newaxis]
y1 = y1true + 0.010 * np.random.randn(m,1)

stride = 3 #use only some points as trainig points e.g 2 means every 2nd
gp1 = GaussProcess(xs1[::stride,:],y1[::stride,:], kernel=kernel_euclid,
                   ridgecoeff=1e-10)
yhatr1 = gp1.predict(xs1)
plt.figure()
plt.plot(y1true, y1,'bo',y1true, yhatr1,'r.')
plt.title('euclid kernel: true y versus noisy y and estimated y')
plt.figure()
plt.plot(y1,'bo-',y1true,'go-',yhatr1,'r.-')
plt.title('euclid kernel: true (green), noisy (blue) and estimated (red) '+
          'observations')