Ordinary least squares regression analysis makes a few strong assumptions about the data:
There are a number of ways in this things can go wrong, for example:
• The errors may not be independent (common in time series data)
• The conditional variance of the residual error may not be constant in \(x\)
• The conditional distribution of \(y\) may be very non-Gaussian.
There are a few ways out, centering around non-parametric methods.
Kernel Estimation: This is really a non-parametric smoothing method. The y value estimated for a given input x depends upon the weight of each labelled data point which in turn depends upon the distance between the input x and the labelled data points.
LOWESS / LOWESS: A non-parametric method (i.e. the mount of data one needs to keep around for a model specification grows linearly with the amount of training data) where, for every new input x for which the value y is desired, one computes a weighted version of the quadratic loss used in OLS linear regression. A weight is attached to each training data point, and the weight of a training point depends upon its the distance from the new input x.
SVM regression: Transform data to a higher dimension using the kernel trick and then perform linear regression using a specific type of loss function (epsilon - insensitive) rather than the usual quadratic (squared loss) function. The epsilon-insensitive loss function encourages the support vectors to be sparse. Imagine doing polynomial regression in the original space by including polynomial terms of the data. The SVM regression achieves a similar effect via the non-linear kernel transform.
Gaussian Process regression: Also a non-parametric method, somewhat similar to LOESS in spirit in that all the data points need to be kept around (as in any non-parametric method), but this is a Bayesian method in that it allows not just a point estimate of y for a new input x, but a complete posterior distribution. This is achieved by placing a prior distribution on the values of a Guassian Process (A GP is simply a collection of random variables, any subset of which have a joint Gaussian distribution) via a covariance kernel function that only specifies the covariance between two Guassian Random variables in the GP given their x values (typically, only the distance between their x values matters). The posterior distribution of Y values for a given set of input X values also turns out to be a Gaussian distribution with an easily computable mean and covariance matrix.