Pythagorean Trees and Forests

Classification Trees are great, but how about when they overgrow even your 27” screen? Can we make the tree fit snugly onto the screen and still tell the whole story? Well, yes we can.

Pythagorean Tree widget will show you the same information as Classification Tree, but way more concisely. Pythagorean Trees represent nodes with squares whose size is proportionate to the number of covered training instances. Once the data is split into two subsets, the corresponding new squares form a right triangle on top of the parent square. Hence Pythagorean Tree. Every square has the color of the prevalent, with opacity indicating the relative proportion of the majority class in the subset. Details are shown in hover balloons.

ClassificationTree
Classification Tree with titanic.tab data set.

 

PythagoreanTree
Pythagorean Tree with titanic.tab data set.

 

When you hover over a square in Pythagorean Tree, a whole line of parent and child squares/nodes is highlighted. Clicking on a square/node outputs the selected subset, just like in Classification Tree.

PythagoreanTree2
Upon hovering on the square in the tree, the lineage (parent and child nodes) is highlighted. Hover also displays information on the subset, represented by the square. The widget outputs the selected subset.

 

Another amazing addition to Orange’s Visualization set is Pythagorean Forest, which is a visualization of Random Forest algorithm. Random Forest takes N samples from a data set with N instances, but with replacement. Then a tree is grown for each sample, which alleviates the Classification Tree’s tendency to overfit the data. Pythagorean Forest is a concise visualization of Random Forest, with each Pythagorean Tree plotted side by side.

PythagoreanForest
Different trees are grown side by side. Parameters for the algorithm are set in Random Forest widget, then the whole forest is sent to Pythagorean Forest for visualization.

 

This makes Pythagorean Forest a great tool to explain how Random Forest works or to further explore each tree in Pythagorean Tree widget.

schema-pythagora

Pythagorean trees are a new addition to Orange. Their implementation has been inspired by a recent paper on Generalized Pythagoras Trees for Visualizing Hierarchies by Fabian Beck, Michael Burch, Tanja Munz, Lorenzo Di Silvestro and Daniel Weiskopf that was presented in at the 5th International Conference on Information Visualization Theory and Applications in 2014.

Random decisions behind your back

When Orange builds a decision tree, candidate attributes are evaluated and the best candidate is chosen. But what if two or more share the first place? Most machine learning systems don’t care about it and always take the first, which is unfair and, besides, has strange effects: the induced model and, consequentially, its accuracy depends upon the order of attributes. Which shouldn’t be.

This is not an isolated problem. Another instance is when a classifier has to choose between two equally probable classes when there is no additional information (such as classification costs) to help make the prediction. Or selecting random reference examples when computing ReliefF. Returning a modus of a distribution with two or more competing values…

The old solution was to make a random selection in such cases. Take a random class (out of the most probable, of course), random attribute, random examples… Although theoretically correct, it comes with a price: the only way to ensure repeatability of experiments is by setting the global random generator, which is not a good practice in component-based systems.

What Orange does now is more cunning. When, for instance, choosing between n equally probable classes, Orange computes something like a hash value from the example to be classified. Its remainder at division by n is then used to select the class. Thus, the class will be random, but always the same for same example.

A similar trick is used elsewhere. To choose an attribute when building a tree, it simply divides the number of learning examples at that node by the number of candidate attributes and the remainder is used again.

When more random numbers are needed, for instance for selecting m random reference examples for computing ReliefF, the number of examples is used for a random seed for a temporary random generator.

To conclude: Orange will sometimes make decisions that will look random. The reason for this is that it is more fair than most of machine learning systems that pick the first (or the last) candidate. But whatever decision is taken, it will be the same if you run the program twice. The message is thus: be aware that this is happenning, but don’t care about it.

Faster classification and regression trees

SimpleTreeLearner is an implementation of classification and regression trees that sacrifices flexibility for speed. A benchmark on 42 different datasets reveals that SimpleTreeLearner is 11 times faster than the original TreeLearner.

The motivation behind developing a new tree induction algorithm from scratch was to speed up the construction of random forests, but you can also use it as a standalone learner. SimpleTreeLearner uses gain ratio for classification and MSE for regression and can handle unknown values.

Comparison with TreeLearner

The graph below shows SimpleTreeLearner construction times on datasets bundled with Orange normalized to TreeLearner. Smaller is better.

SimpleTreeLearner speed

The harmonic mean (average speedup) on all the benchmarks is 11.4.

Usage

The user can set four parameters:

maxMajority
Maximal proportion of majority class.
minExamples
Minimal number of examples in leaves.
maxDepth
Maximal depth of tree.
skipProb
At every split an attribute will be skipped with probability skipProb. This parameter is especially useful for building random forests.

The code snippet below demonstrates the basic usage of SimpleTreeLearner. It behaves much like any other Orange learner would.

import Orange

data = Orange.data.Table("iris")

# build classifier and classify train data
classifier = Orange.classification.tree.SimpleTreeLearner(data, maxMajority=0.8)
for ex in data:
    print classifier(ex)

# estimate classification accuracy with cross-validation
learner = Orange.classification.tree.SimpleTreeLearner(minExamples=2)
result = Orange.evaluation.testing.cross_validation([learner], data)
print 'CA:', Orange.evaluation.scoring.CA(result)[0]