# Visualization of Classification Probabilities

This is a guest blog from the Google Summer of Code project.

Polynomial Classification widget is implemented as a part of my Google Summer of Code project along with other widgets in educational add-on (see my previous blog). It visualizes probabilities for two-class classification (target vs. rest) using color gradient and contour lines, and it can do so for any Orange learner.

Here is an example workflow. The data comes from the File widget. With no learner on input, the default is Logistic Regression. Widget outputs learners Coefficients, Classifier (model) and Learner.

Polynomial Classification widget works on two continuous features only, all other features are ignored. The screenshot shows plot of classification for an Iris data set .

1. Set name of the learner. This is the name of learner on output.
2. Set features that logistic regression is performed on.
3. Set class that is classified separately from other classes.
4. Set the degree of a polynom that is used to transform an input data (1 means attributes are not transformed).
5. Select whether see or not contour lines in chart. The density of contours is regulated by Contour step.

The classification for our case fails in separating Iris-versicolor from the other two classes. This is because logistic regression is a linear classifier, and because there is no linear combination of the chosen two attributes that would make for a good decision boundary. We can change that. Polynomial expansion adds features that are polynomial combinations of original ones. For example, if an input data contains features [a, b], polynomial expansion of degree two generates feature space [1, a, b, a2, a b, b2]. With this expansion, the classification boundary looks great.

Polynomial Classification also works well with other learners. Below we have given it a Classification Tree. This time we have painted the input data using Paint Data, a great data generator used while learning about Orange and data science. The decision boundaries for the tree are all square, a well-known limitation for tree-based learners.

Polynomial expansion if high degrees may be dangerous. Following example shows overfitting when degree is five. See the two outliers, a blue one on the top and the red one at the lower right of the plot? The classifier was unnecessary able to separate the outliers from the pack, something that will become problematic when classifier will be used on the new data.

Overfitting is one of the central problems in machine learning. You are welcome to read our previous blog on this problem and possible solutions.

# Interactive k-Means

This is a guest blog from the Google Summer of Code project.

As a part of my Google Summer of Code project I started developing educational widgets and assemble them in an Educational Add-On for Orange. Educational widgets can be used by students to understand how some key data mining algorithms work and by teachers to demonstrate the working of these algorithms.

Here I describe an educational widget for interactive k-means clustering, an algorithm that splits the data into clusters by finding cluster centroids such that the distance between data points and their corresponding centroid is minimized. Number of clusters in k-means algorithm is denoted with k and has to be specified manually.

The algorithm starts by randomly positioning the centroids in the data space, and then improving their position by repetition of the following two steps:

1. Assign each point to the closest centroid.
2. Move centroids to the mean position of points assigned to the centroid.

The widget needs the data that can come from File widget, and outputs the information on clusters (Annotated Data) and centroids:

Educational widget for k-means works finds clusters based on two continuous features only, all other features are ignored. The screenshot shows plot of an Iris data set and clustering with k=3. That is partially cheating, because we know that iris data set has three classes, so that we can check if clusters correspond well to original classes:

1. Select two features that are used in k-means
2. Set number of centroids
3. Randomize positions of centroids
4. Show lines between centroids and corresponding points
5. Perform the algorithm step by step. Reassign membership connects points to nearest centroid, Recompute centroids moves centroids.
6. Step back in the algorithm
7. Set speed of automatic stepping
8. Perform the whole algorithm as fast preview
9.  Anytime we can change number of centroids with spinner or with click in desired position in the graph.

If we want to see the correspondence of clusters that are denoted by k-means and classes, we can open Data Table widget where we see that all Iris-setosas are clustered in one cluster and but there are just few Iris-versicolor that are classified is same cluster together with Iris-virginica and vice versa.

Interactive k-means works great in combination with Paint Data. There, we can design data sets where k-mains fails, and observe why.

We could also design data sets where k-means fails under specific initialization of centroids. Ah, I did not tell you that you can freely move the centroids and then restart the algorithm. Below we show the case of centroid initialization and how this leads to non-optimal clustering.

# Color it!

Holiday season is upon us and even the Orange team is in a festive mood. This is why we made a Color widget!

This fascinating artsy widget will allow you to play with your data set in a new and exciting way. No more dull visualizations and default color schemes! Set your own colors the way YOU want it to! Care for some magical cyan-to-magenta? Or do you prefer a more festive red-to-green? How about several shades of gray? Color widget is your go-to stop for all things color (did you notice it’s our only widget with a colorful icon?). 🙂

Coloring works with most visualization widgets, such as scatter plot, distributions, box plot, mosaic display and linear projection. Set the colors for discrete values and gradients for continuous values in this widget, and the same palletes will be used in all downstream widgets. As a bonus, the Color widget also allows you to edit the names of variables and values.

Remember – the (blue) sky is the limit.

# Hubbing with the Hub widget

So you have painted two data sets and loaded another one from a file, and now you are testing predictions of logistic regression, classification trees and SVM on it? Tired of having to reconnect the Paint data widget and the File widget back and forth whenever you switch between them?

Say no more! Look no further! Here is the new Hub widget!

Hub widget is the most versatile widget available so far. It accepts several inputs of any type and outputs them to as many other widgets as you want.

The Hub widget treats all types with the strictest equality.

(It also adheres to all applicable EU policies with respect to gender equality, and does not use cookies.)

The Hub widget works like charm and is like the amazing cast-to-void-and-back-to-anything idiom in C. This strongful MacGyver of widgets can (almost) convert classification tree into data, or preprocessor into experimental results without ever touching the data. With its amazing capabilities, the Hub widget has the potential to cause an even greater havoc in your workflows than the famous Merge data widget.