Notes on non-parametric regression & smoothing

Ordinary least squares regression analysis makes a few strong assumptions about the data:

• A linear relationship of y to the x’s, so that $$y | x = f(x) + \epsilon$$ • That the conditional distribution of y is, except for its mean, everywhere the same, and that this distribution is a normal distribution $$y \sim \mathcal{N}(f(x), \sigma)$$ • That observations are sampled independently, so the \(y_i\) and \(y_j\) are independent for \(i \neq j\).

There are a number of ways in this things can go wrong, for example:
• The errors may not be independent (common in time series data)
• The conditional variance of the residual error may not be constant in \(x\)
• The conditional distribution of \(y\) may be very non-Gaussian.

There are a few ways out, centering around non-parametric methods.  

Kernel Estimation: This is really a non-parametric smoothing method. The y value estimated for a given input x depends upon the weight of each labelled data point which in turn depends upon the distance between the input x and the labelled data points. 

$$\hat{y} = \frac{\sum_i w_i y_i}{\sum_i w_i},$$ where \(w_i = K(\frac{x - x_i}{h})\). \(K(\cdot)\) is a Kernel function, commonly taken to be the Gaussian Kernel, and \(h\) is a bandwidth parameter that dictates the extent to which labelled data points exert their influence in the weighted average. The weighted estimator is also called a Nadaraya-Watson Estimator. Unlike LOESS however, there is no fitting of parameters involved. The y value is simply the weighted average of the y values of all other labelled data points.

LOWESS / LOWESS: A non-parametric method (i.e. the mount of data one needs to keep around for a model specification grows linearly with the amount of training data) where, for every new input x for which the value y is desired, one computes a weighted version of the quadratic loss used in OLS linear regression. A weight is attached to each training data point, and the weight of a training point depends upon its the distance from the new input x.

SVM regression: Transform data to a higher dimension using the kernel trick and then perform linear regression using a specific type of loss function (epsilon - insensitive) rather than the usual quadratic (squared loss) function. The epsilon-insensitive loss function encourages the support vectors to be sparse. Imagine doing polynomial regression in the original space by including polynomial terms of the data. The SVM regression achieves a similar effect via the non-linear kernel transform. 

Gaussian Process regression: Also a non-parametric method, somewhat similar to LOESS in spirit in that all the data points need to be kept around (as in any non-parametric method), but this is a Bayesian method in that it allows not just a point estimate of y for a new input x, but a complete posterior distribution. This is achieved by placing a prior distribution on the values of a Guassian Process (A GP is simply a collection of random variables, any subset of which have a joint Gaussian distribution) via a covariance kernel function that only specifies the covariance between two Guassian Random variables in the GP given their x values (typically, only the distance between their x values matters). The posterior distribution of Y values for a given set of input X values also turns out to be a Gaussian distribution with an easily computable mean and covariance matrix.