All I See is Silhouette

Silhouette plot is such a nice method for visually assessing cluster quality and the degree of cluster membership that we simply couldn’t wait to get it into Orange3. And now we did.

What this visualization displays is the average distance between instances within the cluster and instances in the nearest cluster. For a given data instance, the silhouette close to 1 indicates that the data instance is close to the center of the cluster. Instances with silhouette scores close to 0 are on the border between two clusters. Overall, the quality of the clustering could be assessed by the average silhouette scores of the data instances. But here, we are more interested in the individual silhouettes and their visualization in the silhouette plot.

Using the good old iris data set, we are going to assess the silhouettes for each of the data instances. In k-means we set the number of clusters to 3 and send the data to Silhouette plot. Good clusters should include instances with higher silhouette scores. But we’re doing the opposite. In Orange, we are selecting instances with scores close to 0 from the silhouette plot and pass them to other widgets for exploration. No surprise, they are at the periphery of two clusters. This is so perfectly demonstrated in the scatter plot.

silhouette4

Let’s do something wild now. We’ll use the silhouette on a class attribute of Iris (no clustering here, just using the original class values from the data set). Here is our hypothesis: the data instances with low silhouette values are also those that will be misclassified by some learning algorithm. Say, by a random forest.

silhouette1

We will use ten-fold cross validation in Test&Score, send the evaluation results to confusion matrix and select misclassified instances in the widget. Then we will explore the inclusion of these misclassifications in the set of low-silhouette instances in the Venn diagram. The agreement (i.e. the intersection in Venn) between the two techniques is quite high.

silhouette3

Finally, we can observe these instances in the Scatter Plot. Classifiers indeed have problems with borderline data instances. Our hypothesis was correct.

silhouette4

Silhouette plot is yet another one of the great visualizations that can help you with data analysis or with understanding certain machine learning concepts. What did we say? Fruitful and fun!

 

 

Overfitting and Regularization

A week ago I used Orange to explain the effects of regularization. This was the second lecture in the Data Mining class, the first one was on linear regression. My introduction to the benefits of regularization used a simple data set with a single input attribute and a continuous class. I drew a data set in Orange, and then used Polynomial Regression widget (from Prototypes add-on) to plot the linear fit. This widget can also expand the data set by adding columns with powers of original attribute x, thereby augmenting the training set with x^p, where x is our original attribute and p an integer going from 2 to K. The polynomial expansion of data sets allows linear regression model to nicely fit the data, and with higher K to overfit it to extreme, especially if the number of data points in the training set is low.

poly-overfit

We have already blogged about this experiment a while ago, showing that it is easy to see that linear regression coefficients blow out of proportion with increasing K. This leads to the idea that linear regression should not only minimize the squared error when predicting the value of dependent variable in the training set, but also keep model coefficients low, or better, penalize any high value of coefficients. This procedure is called regularization. Based on the type of penalty (sum of coefficient squared or sum of absolute values), the regularization is referred to L1 or L2, or, ridge and lasso regression.

It is quite easy to play with regularized models in Orange by attaching a Linear Regression widget to Polynomial Regression, in this way substituting the default model used in Polynomial Regression with the one designed in Linear Regression widget. This makes available different kinds of regularization. This workflow can be used to show that the regularized models less overfit the data, and that the overfitting depends on the regularization coefficient which governs the degree of penalty stemming from the value of coefficients of the linear model.

poly-l2

I also use this workflow to show the difference between L1 and L2 regularization. The change of the type of regularization is most pronounced in the table of coefficients (Data Table widget), where with L1 regularization it is clear that this procedure results in many of those being 0. Try this with high value for degree of polynomial expansion, and a data set with about 10 data points. Also, try changing the regularization regularization strength (Linear Regression widget).

poly-l1

While the effects of overfitting and regularization are nicely visible in the plot in Polynomial Regression widget, machine learning models are really about predictions. And the quality of predictions should really be estimated on independent test set. So at this stage of the lecture I needed to introduce the model scoring, that is, a measure that tells me how well my model inferred on the training set performs on the test set. For simplicity, I chose to introduce root mean squared error (RMSE) and then crafted the following workflow.

poly-evaluate

Here, I draw the data set (Paint Data, about 20 data instances), assigned y as the target variable (Select Columns), split the data to training and test sets of approximately equal sizes (Data Sampler), and pass training and test data and linear model to the Test & Score widget. Then I can use linear regression with no regularization, and expect how RMSE changes with changing the degree of the polynomial. I can alternate between Test on train data and Test on test data (Test & Score widget). In the class I have used the blackboard to record this dependency. For the data from the figure, I got the following table:

Poly K RMSE Train RMSE Test
0 0.147 0.138
1 0.155 0.192
2 0.049 0.063
3 0.049 0.063
4 0.049 0.067
5 0.040 0.408
6 0.040 0.574
7 0.033 2.681
8 0.001 5.734
9 0.000 4.776

That’s it. For the class of computer scientists, one may do all this in scripting, but for any other audience, or for any introductory lesson, explaining of regularization with Orange widgets is a lot of fun.

Orange at Google Summer of Code 2016

Orange team is extremely excited to be a part of this year’s Google Summer of Code! GSoC is a great opportunity for students around the world to spend their summer contributing to an open-source software, gaining experience and earning money.

Accepted students will help us develop Orange (or other chosen OSS project) from May to August. Each student is expected to select and define a project of his/her interest and will be ascribed a mentor to guide him/her through the entire process.

 

Apply here:

https://summerofcode.withgoogle.com/

Orange’s project proposals (we accept your own ideas as well!):

https://github.com/biolab/orange3/wiki/GSoC-2016

Our GSoC community forum:

https://groups.google.com/forum/#!forum/orange-gsoc

 

Spread the word! (and don’t forget to apply 😉 )

 

1