Overfitting and Regularization

A week ago I used Orange to explain the effects of regularization. This was the second lecture in the Data Mining class, the first one was on linear regression. My introduction to the benefits of regularization used a simple data set with a single input attribute and a continuous class. I drew a data set in Orange, and then used Polynomial Regression widget (from Prototypes add-on) to plot the linear fit. This widget can also expand the data set by adding columns with powers of original attribute x, thereby augmenting the training set with x^p, where x is our original attribute and p an integer going from 2 to K. The polynomial expansion of data sets allows linear regression model to nicely fit the data, and with higher K to overfit it to extreme, especially if the number of data points in the training set is low.


We have already blogged about this experiment a while ago, showing that it is easy to see that linear regression coefficients blow out of proportion with increasing K. This leads to the idea that linear regression should not only minimize the squared error when predicting the value of dependent variable in the training set, but also keep model coefficients low, or better, penalize any high value of coefficients. This procedure is called regularization. Based on the type of penalty (sum of coefficient squared or sum of absolute values), the regularization is referred to L1 or L2, or, ridge and lasso regression.

It is quite easy to play with regularized models in Orange by attaching a Linear Regression widget to Polynomial Regression, in this way substituting the default model used in Polynomial Regression with the one designed in Linear Regression widget. This makes available different kinds of regularization. This workflow can be used to show that the regularized models less overfit the data, and that the overfitting depends on the regularization coefficient which governs the degree of penalty stemming from the value of coefficients of the linear model.


I also use this workflow to show the difference between L1 and L2 regularization. The change of the type of regularization is most pronounced in the table of coefficients (Data Table widget), where with L1 regularization it is clear that this procedure results in many of those being 0. Try this with high value for degree of polynomial expansion, and a data set with about 10 data points. Also, try changing the regularization regularization strength (Linear Regression widget).


While the effects of overfitting and regularization are nicely visible in the plot in Polynomial Regression widget, machine learning models are really about predictions. And the quality of predictions should really be estimated on independent test set. So at this stage of the lecture I needed to introduce the model scoring, that is, a measure that tells me how well my model inferred on the training set performs on the test set. For simplicity, I chose to introduce root mean squared error (RMSE) and then crafted the following workflow.


Here, I draw the data set (Paint Data, about 20 data instances), assigned y as the target variable (Select Columns), split the data to training and test sets of approximately equal sizes (Data Sampler), and pass training and test data and linear model to the Test & Score widget. Then I can use linear regression with no regularization, and expect how RMSE changes with changing the degree of the polynomial. I can alternate between Test on train data and Test on test data (Test & Score widget). In the class I have used the blackboard to record this dependency. For the data from the figure, I got the following table:

Poly K RMSE Train RMSE Test
0 0.147 0.138
1 0.155 0.192
2 0.049 0.063
3 0.049 0.063
4 0.049 0.067
5 0.040 0.408
6 0.040 0.574
7 0.033 2.681
8 0.001 5.734
9 0.000 4.776

That’s it. For the class of computer scientists, one may do all this in scripting, but for any other audience, or for any introductory lesson, explaining of regularization with Orange widgets is a lot of fun.

Model-Based Feature Scoring

Feature scoring and ranking can help in understanding the data in supervised settings. Orange includes a number of standard feature scoring procedures one can access in the Rank widget. Moreover, a number of modeling techniques, like linear or logistic regression, can rank features explicitly through assignment of weights. Trained models like random forests have their own methods for feature scoring. Models inferred by these modeling techniques depend on their parameters, like type and level of regularization for logistic regression. Same holds for feature weight: any change of parameters of the modeling techniques would change the resulting feature scores.

It would thus be great if we could observe these changes and compare feature ranking provided by various machine learning methods. For this purpose, the Rank widget recently got a new input channel called scorer. We can attach any learner that can provide feature scores to the input of Rank, and then observe the ranking in the Rank table.


Say, for the famous voting data set (File widget, Browse documentation data sets), the last two feature score columns were obtained by random forest and logistic regression with L1 regularization (C=0.1). Try changing the regularization parameter and type to see changes in feature scores.


Feature weights for logistic and linear regression correspond to the absolute value of coefficients of their linear models. To observe their untransformed values in the table, these widgets now also output a data table with feature weights. (At the time of the writing of this blog, this feature has been implemented for linear regression; other classifiers and regressors that can estimate feature weights will be updated soon).


A visit from the Tilburg University

Biolab is currently hosting two amazing data scientists from the Tilburg University – dr. Marie Nilsen and dr. Eric Postma, who are preparing a 20-lecture MOOC on data science for non-technical audience. A part of the course will use Orange. The majority of their students is coming from humanities, law, economy and behavioral studies, thus we are discussing options and opportunities for adapting Orange for social scientists. Another great thing is that the course is designed for beginner level data miners, showcasing that anybody can mine the data and learn from it. And then consult with statisticians and data mining expert (of course!).

Biolab team with Marie and Eric, who is standing next to Ivan Cankar - the very serious guy in the middle.
Biolab team with Marie and Eric, who is standing next to Ivan Cankar – the very serious guy in the middle.


To honor this occasion we invite you to check out the Polynomial regression widget, which is specially intended for educational use. There, you can showcase the problem of overfitting through visualization.

First, we set up a workflow.


Then we paint, say, at most 10 points into the Paint Data widget. (Why at most ten? You’ll see later.)



Now we open our Polynomial Regression widget, where we play with polynomial degree. Polynomial Degree 1 gives us a line. With coefficient 2 we get a curve that fits only one point. However, with the coefficient 7 we fit all the points with one curve. Yay!





But hold on! The curve now becomes very steep. Would the lower end of the curve at about (0.9, -2.2) still be a realistic estimate of our data set? Probably not. Even when we look at the Data Table with coefficient values, they seem to skyrocket.



This is a typical danger of overfitting, which is often hard to explain, but with the help of these three widgets becomes as clear as day!
Now go out and share the knowledge!!!

New in Orange: Partial least squares regression

Partial least squares regression is a regression technique which supports multiple response variables. PLS regression is very popular in areas such as bioinformatics, chemometrics etc. where the number of observations is usually less than the number of measured variables and where there exists multicollinearity among the predictor variables. In such situations, standard regression techniques would usually fail. The PLS regression is now available in Orange (see documentation)!

You can use PLS regression model on single-target or multi-target data sets. Simply load the data set multitarget-synthetic.tab and see that it contains three predictor variables and four responses using this code.

data = Orange.data.Table("multitarget-synthetic.tab")
print "Input variables:"
print data.domain.features
print "Response variables:"
print data.domain.class_vars


Input variables:
<FloatVariable 'X1', FloatVariable 'X2', FloatVariable 'X3'>
Response variables:
<FloatVariable 'Y1', FloatVariable 'Y2', FloatVariable 'Y3', FloatVariable 'Y4'>

As you can see, all variables in this data set are continuous. The PLS regression is intended forsuch situations although it can be used for discrete input variables as well (using 0-1 continuation). Currently, discrete response variables are not yet supported.

Let’s try to fit the PLS regression model on our data set.

learner = Orange.multitarget.pls.PLSRegressionLearner()
classifier = learner(data)

The classifier can be now used to predict values of the four responses based onthree predictors. Let’s see how it manages this task on the first data instance.

actual = data[0].get_classes()
predicted = classifier(data[0]) 

print "Actual", "Predicted"
for a, p in zip(actual, predicted):
    print "%6.3f %6.3f" % (a,p)


Actual Predicted
 0.490  0.613
 1.237  0.826
 1.808  1.084
 0.422  0.534

To test the usefulness of PLS as a multi-target method let’s compare it to a single-target method – linear regression. We did this by comparing Root mean squared error (RMSE) of predicted values for a single response variable. We constructed synthetic data sets and performed the RMSE analysis using this script. The results can be seen in the following output:

    Training set sizes      5     10     20     50    100    200    500   1000
Linear (single-target) 0.5769 0.3128 0.2703 0.2529 0.2493 0.2446 0.2436 0.2442
    PLS (multi-target) 0.3663 0.2955 0.2623 0.2517 0.2487 0.2447 0.2441 0.2448

We can see that PLS regression outperforms linear regression when the number of training instances is low. Such situations (low number of instances compared to high number of variables) are quite common when analyzing data sets in bioinformatics. However, with increasing number of training instances, the advantages of PLS regression diminish.

Earth – Multivariate adaptive regression splines

There have recently been some additions to the lineup of Orange learners. One of these is Orange.regression.earth.EarthLearner. It is an Orange interface to the Earth library written by Stephen Milborrow implementing Multivariate adaptive regression splines.

So lets take it out for a spin on a simple toy dataset (data.tab – created using the Paint Data widget in the Orange Canvas):

import Orange
from Orange.regression import earth
import numpy
from matplotlib import pylab as pl

data = Orange.data.Table("data.tab")
earth_predictor = earth.EarthLearner(data)

X, Y = data.to_numpy("A/C")

pl.plot(X, Y, ".r")

linspace = numpy.linspace(min(X), max(X), 20)
predictions = [earth_predictor([s, "?"]) for s in linspace]

pl.plot(linspace, predictions, "-b")

which produces the following plot:

Earth predicitons

We can also print the model representation using

print earth_predictor

which outputs:

Y =
   +1.198 * max(0, X - 0.485)
   -1.803 * max(0, 0.485 - X)
   -1.321 * max(0, X - 0.283)
   -1.609 * max(0, X - 0.640)
   +1.591 * max(0, X - 0.907)

See Orange.regression.earth reference for full documentation.

(Edit: Added link to the dataset file)

Faster classification and regression trees

SimpleTreeLearner is an implementation of classification and regression trees that sacrifices flexibility for speed. A benchmark on 42 different datasets reveals that SimpleTreeLearner is 11 times faster than the original TreeLearner.

The motivation behind developing a new tree induction algorithm from scratch was to speed up the construction of random forests, but you can also use it as a standalone learner. SimpleTreeLearner uses gain ratio for classification and MSE for regression and can handle unknown values.

Comparison with TreeLearner

The graph below shows SimpleTreeLearner construction times on datasets bundled with Orange normalized to TreeLearner. Smaller is better.

SimpleTreeLearner speed

The harmonic mean (average speedup) on all the benchmarks is 11.4.


The user can set four parameters:

Maximal proportion of majority class.
Minimal number of examples in leaves.
Maximal depth of tree.
At every split an attribute will be skipped with probability skipProb. This parameter is especially useful for building random forests.

The code snippet below demonstrates the basic usage of SimpleTreeLearner. It behaves much like any other Orange learner would.

import Orange

data = Orange.data.Table("iris")

# build classifier and classify train data
classifier = Orange.classification.tree.SimpleTreeLearner(data, maxMajority=0.8)
for ex in data:
    print classifier(ex)

# estimate classification accuracy with cross-validation
learner = Orange.classification.tree.SimpleTreeLearner(minExamples=2)
result = Orange.evaluation.testing.cross_validation([learner], data)
print 'CA:', Orange.evaluation.scoring.CA(result)[0]