# Debugging the process of applying a machine learning algorithm to a dataset

Getting one's hands dirty with applying machine learning theory to a real world dataset is really quite different from simply understanding the theory behind machine learning algorithms. Often, the experience of learning a regressor or a classifier from real a world dataset goes something like this: You have a nice big dataset. You have cleaned it up, normalized it and are ready to apply your chosen algorithm on the dataset in your chosen programming language/statistical computing framework. You find that the generalization error of the model you have learnt is really poor. What do you do next? You can chose a different learning algorithm, one that is more complex, one that is less complex. Or, you can try to supply more training data to the same algorithm in the hope that the generalization error performance improves. What you do next largely determines whether or not you are going to succeed in learning well from the dataset.

It turns out that there is a principled approach to figuring out how to navigate the set of available choices (more/less complex algorithm, more training data). Computational learning theory provides insights into what must be done next. Andrew Ng has a really nice section in his Machine Learning class in which he goes over the most problematic parts of applied machine learning research. According to Andrew, a lot of inexperienced practitioners and startups using machine learning are unable to answer this question, and go about making ad-hoc choices, wasting precious time in the process. I completely agree. I suspect that understanding this aspect of machine learning is also what separates the winners from the rest in machine learning competitions.

The plot on the left show what happens as the number of training samples is increased with a fixed model. The error on the training set actually increases, while the generalization error, i.e. the error on an unseen test sample, decreases. The two converge to an asymptote. Naturally, models with different complexity have asymptots at different error values. In the plot above, model 1 converges to a higher error rate than model 2 because model 1 is not complex enough to capture the structure of the dataset. However, if the amount of training data available is less than a certain threshold, then the less complex model 1 wins. This regime is important because often the amount of training data is fixed -- it is just not possible to obtain any more training data.

The plot on the right shows the change in training error and generalization error as the model complexity is increased, and the size of the training set is held constant. The training set error decreases monotonically, but the generalization error first falls and then increases. This occurs because by choosing a progressively complex model, at some point, the model begins to overfit the training data. That is, given the larger number of degrees of freedom in a more complex model, the model begins to adapt to the noise present in the dataset. This hurts generalization error. One way to get around this is to regularize the model -- i.e. somehow constrain the model parameters, which implicitly reduces the degrees of freedom. If the learning problem can be framed as an optimization problem (e.g. find minimum mean square error), then one way to regularize is to alter the objective function by adding on a term involving the model parameters. Regularization is a flexible means to control the model's capacity so as to avoid overfitting or underfitting.

If application of an ML algorithm shows that training and generalization error rates have leveled out, close to their asymptotic value, then adding additional training samples is not going to help! The only way to improve generalization error would be to use a different, more complex model. On the other hand, if all the available training samples have been used, and increasing the model's complexity seems to be increasing the generalization error, then this points to overfitting. In this case, increasing the model complexity is not going to help. generalization error is usually measured by keeping aside part of the training data (say, 30%) and using only the remaining data for training. A more common technique, termed k-fold cross validation, is to use a smaller portion (say, 5%) as test data, but repeat k-times with random portions kept aside as test data.