Pythagorean Trees and Forests

Classification Trees are great, but how about when they overgrow even your 27” screen? Can we make the tree fit snugly onto the screen and still tell the whole story? Well, yes we can.

Pythagorean Tree widget will show you the same information as Classification Tree, but way more concisely. Pythagorean Trees represent nodes with squares whose size is proportionate to the number of covered training instances. Once the data is split into two subsets, the corresponding new squares form a right triangle on top of the parent square. Hence Pythagorean Tree. Every square has the color of the prevalent, with opacity indicating the relative proportion of the majority class in the subset. Details are shown in hover balloons.

ClassificationTree
Classification Tree with titanic.tab data set.

 

PythagoreanTree
Pythagorean Tree with titanic.tab data set.

 

When you hover over a square in Pythagorean Tree, a whole line of parent and child squares/nodes is highlighted. Clicking on a square/node outputs the selected subset, just like in Classification Tree.

PythagoreanTree2
Upon hovering on the square in the tree, the lineage (parent and child nodes) is highlighted. Hover also displays information on the subset, represented by the square. The widget outputs the selected subset.

 

Another amazing addition to Orange’s Visualization set is Pythagorean Forest, which is a visualization of Random Forest algorithm. Random Forest takes N samples from a data set with N instances, but with replacement. Then a tree is grown for each sample, which alleviates the Classification Tree’s tendency to overfit the data. Pythagorean Forest is a concise visualization of Random Forest, with each Pythagorean Tree plotted side by side.

PythagoreanForest
Different trees are grown side by side. Parameters for the algorithm are set in Random Forest widget, then the whole forest is sent to Pythagorean Forest for visualization.

 

This makes Pythagorean Forest a great tool to explain how Random Forest works or to further explore each tree in Pythagorean Tree widget.

schema-pythagora

Pythagorean trees are a new addition to Orange. Their implementation has been inspired by a recent paper on Generalized Pythagoras Trees for Visualizing Hierarchies by Fabian Beck, Michael Burch, Tanja Munz, Lorenzo Di Silvestro and Daniel Weiskopf that was presented in at the 5th International Conference on Information Visualization Theory and Applications in 2014.

All I See is Silhouette

Silhouette plot is such a nice method for visually assessing cluster quality and the degree of cluster membership that we simply couldn’t wait to get it into Orange3. And now we did.

What this visualization displays is the average distance between instances within the cluster and instances in the nearest cluster. For a given data instance, the silhouette close to 1 indicates that the data instance is close to the center of the cluster. Instances with silhouette scores close to 0 are on the border between two clusters. Overall, the quality of the clustering could be assessed by the average silhouette scores of the data instances. But here, we are more interested in the individual silhouettes and their visualization in the silhouette plot.

Using the good old iris data set, we are going to assess the silhouettes for each of the data instances. In k-means we set the number of clusters to 3 and send the data to Silhouette plot. Good clusters should include instances with higher silhouette scores. But we’re doing the opposite. In Orange, we are selecting instances with scores close to 0 from the silhouette plot and pass them to other widgets for exploration. No surprise, they are at the periphery of two clusters. This is so perfectly demonstrated in the scatter plot.

silhouette4

Let’s do something wild now. We’ll use the silhouette on a class attribute of Iris (no clustering here, just using the original class values from the data set). Here is our hypothesis: the data instances with low silhouette values are also those that will be misclassified by some learning algorithm. Say, by a random forest.

silhouette1

We will use ten-fold cross validation in Test&Score, send the evaluation results to confusion matrix and select misclassified instances in the widget. Then we will explore the inclusion of these misclassifications in the set of low-silhouette instances in the Venn diagram. The agreement (i.e. the intersection in Venn) between the two techniques is quite high.

silhouette3

Finally, we can observe these instances in the Scatter Plot. Classifiers indeed have problems with borderline data instances. Our hypothesis was correct.

silhouette4

Silhouette plot is yet another one of the great visualizations that can help you with data analysis or with understanding certain machine learning concepts. What did we say? Fruitful and fun!