It’s Sailing Time (Again)

Every fall I teach a course on Introduction to Data Mining. And while the course is really on statistical learning and its applications, I also venture into classification trees. For several reasons. First, I can introduce information gain and with it feature scoring and ranking. Second, classification trees are one of the first machine learning approaches co-invented by engineers (Ross Quinlan) and statisticians (Leo Breiman, Jerome Friedman, Charles J. Stone, Richard A. Olshen). And finally, because they make the base of random forests, one of the most accurate machine learning models for smaller and mid-size data sets.

Related: Introduction to Data Mining Course in Houston

Lecture on classification trees has to start with the data. Years back I have crafted a data set on sailing. Every data set has to have a story. Here is one:

Sara likes weekend sailing. Though, not under any condition. Past
twenty Wednesdays I have asked her if she will have any company, what
kind of boat she can rent, and I have checked the weather
forecast. Then, on Saturday, I wrote down if she actually went to the Sea.

Data on Sara’s sailing contains three attributes (Outlook, Company, Sailboat) and a class (Sail).

The data comes with Orange and you can get them from Data Sets widget (currently in Prototypes Add-On, but soon to be moved to core Orange). It takes time, usually two lecture hours, to go through probabilities, entropy and information gain, but at the end, the data analysis workflow we develop with students looks something like this:

And here is the classification tree:

Turns out that Sara is a social person. When the company is big, she goes sailing no matter what. When the company is smaller, she would not go sailing if the weather is bad. But when it is sunny, sailing is fun, even when being alone.

Related: Pythagorean Trees and Forests

Classification trees are not very stable classifiers. Even with small changes in the data, the trees can change substantially. This is an important concept that leads to the use of ensembles like random forests. It is also here, during my lecture, that I need to demonstrate this instability. I use Data Sampler and show a classification tree under the current sampling. Pressing on Sample Data button the tree changes every time. The workflow I use is below, but if you really want to see this in action, well, try it in Orange.

The Beauty of Random Forest

It is the time of the year when we adore Christmas trees. But these are not the only trees we, at Orange team, think about. In fact, through almost life-long professional deformation of being a data scientist, when I think about trees I would often think about classification and regression trees. And they can be beautiful as well. Not only for their elegance in explaining the hidden patterns, but aesthetically, when rendered in Orange. And even more beautiful then a single tree is Orange’s rendering of a forest, that is, a random forest.

Related: Pythagorean Trees and Forests

Here are six trees in the random forest constructed on the housing data set:

The random forest for annealing data set includes a set of smaller-sized trees:

A Christmas-lit random forest inferred from pen digits data set looks somehow messy in trying to categorize to ten different classes:

The power of beauty! No wonder random forests are one of the best machine learning tools. Orange renders them according to the idea of Fabian Beck and colleagues who proposed Pythagoras trees for visualizations of hierarchies. The actual implementation for classification and regression trees for Orange was created by Pavlin Policar.