Data Mining Course in Houston

We have just completed an Introduction to Data Mining, a graduate course at Baylor College of Medicine in Texas, Houston. The course was given in September and consisted of seven two-hour lectures, each one followed with a homework assignment. The course was attended by about 40 students and some faculty and research staff.


This was a challenging course. The audience was new to data mining, and we decided to teach them with the newest, third version of Orange. We also experimented with two course instructors (Blaz and Janez), who, instead of splitting the course into two parts, taught simultaneously, one on the board and the other one helping the students with hands-on exercises. To check whether this worked fine, we ran a student survey at the end of the course. We used Google Sheets and then examined the results with students in the class. Using Orange, of course.


And the outcome? Looks like the students really enjoyed the course


and the teaching style.


The course took advantage of several new widgets in Orange 3, including those for data preprocessing and polynomial regression. The core development team put a lot of effort during the summer to debug and polish this newest version of Orange. Also thanks to the financial support by AXLE EU FP7 and CARE-MI EU FP7 grants and grants by the Slovene Research agency, we were able to finish everything in time.

A visit from the Tilburg University

Biolab is currently hosting two amazing data scientists from the Tilburg University – dr. Marie Nilsen and dr. Eric Postma, who are preparing a 20-lecture MOOC on data science for non-technical audience. A part of the course will use Orange. The majority of their students is coming from humanities, law, economy and behavioral studies, thus we are discussing options and opportunities for adapting Orange for social scientists. Another great thing is that the course is designed for beginner level data miners, showcasing that anybody can mine the data and learn from it. And then consult with statisticians and data mining expert (of course!).

Biolab team with Marie and Eric, who is standing next to Ivan Cankar - the very serious guy in the middle.
Biolab team with Marie and Eric, who is standing next to Ivan Cankar – the very serious guy in the middle.


To honor this occasion we invite you to check out the Polynomial regression widget, which is specially intended for educational use. There, you can showcase the problem of overfitting through visualization.

First, we set up a workflow.


Then we paint, say, at most 10 points into the Paint Data widget. (Why at most ten? You’ll see later.)



Now we open our Polynomial Regression widget, where we play with polynomial degree. Polynomial Degree 1 gives us a line. With coefficient 2 we get a curve that fits only one point. However, with the coefficient 7 we fit all the points with one curve. Yay!





But hold on! The curve now becomes very steep. Would the lower end of the curve at about (0.9, -2.2) still be a realistic estimate of our data set? Probably not. Even when we look at the Data Table with coefficient values, they seem to skyrocket.



This is a typical danger of overfitting, which is often hard to explain, but with the help of these three widgets becomes as clear as day!
Now go out and share the knowledge!!!

Save your graphs!

If you are often working with Orange, you probably have noticed a small button at the bottom of most visualization widgets. “Save Graph” now enables you to export graphs, charts, and hierarchical trees to your computer and use them in your reports. Because people need to see it to believe it!

“Save Graph” will save visualizations to your computer.


Save Graph function is available in Paint Data, Image Viewer, all visualization widgets, and a few others (list is below).

Widgets with the “Save Graph” option.


You can save visualizations in .png, .dot or .svg format. However – brace yourselves – our team is working on something even better, which will be announced in the following weeks.

Hubbing with the Hub widget

So you have painted two data sets and loaded another one from a file, and now you are testing predictions of logistic regression, classification trees and SVM on it? Tired of having to reconnect the Paint data widget and the File widget back and forth whenever you switch between them?

Say no more! Look no further! Here is the new Hub widget!

Multiple file inputs


Hub widget is the most versatile widget available so far. It accepts several inputs of any type and outputs them to as many other widgets as you want.

The Hub widget treats all types with the strictest equality.

(It also adheres to all applicable EU policies with respect to gender equality, and does not use cookies.)

Diverse widget input

The Hub widget works like charm and is like the amazing cast-to-void-and-back-to-anything idiom in C. This strongful MacGyver of widgets can (almost) convert classification tree into data, or preprocessor into experimental results without ever touching the data. With its amazing capabilities, the Hub widget has the potential to cause an even greater havoc in your workflows than the famous Merge data widget.

Download, install – and start hubbing today !!


Updated Widget Documentation

Happy news for all passionate Orange users! We’ve uploaded documentation for our Orange 3 widget selection.


Right click and select "Help" or press F1.
Right click and select “Help” or press F1.


It’s easy to use. To learn more about a particular wigdet, click on the widget. Either use right click and select “Help” or press F1. A new window will open with a widget description and an example for its use. There are also screenshots included as visual help.


Widget documentation.
Widget documentation.


We are going to be updating documentation as the widgets continue to develop. Documentation for bioinformatics and data fusion add-ons is expected to be up and running in the following week.

Scatter Plot Projection Rank

One of the nicest and surely most useful visualization widgets in Orange is Scatter Plot. The widget displays a 2-D plot, where x and y-axes are two attributes from the data.

2-dimensional scatter plot visualization
2-dimensional scatter plot visualization


Orange 2.7 has a wonderful functionality called VizRank, that is now implemented also in Orange 3. Rank Projections functionality enables you to find interesting attribute pairs by scoring their average classification accuracy. Click ‘Start Evaluation’ to begin ranking.

Rank Projections before ranking is performed.
Rank Projections before ranking is performed.


The functionality will also instantly adapt the visualization to the best scored pair. Select other pairs from the list to compare visualizations.

Rank Projections once the attribute pairs are scored.
Rank Projections once the attribute pairs are scored.


Rank suggested petal length and petal width as the best pair and indeed, the visualization below is much clearer (better separated).

Scatter Plot once the visualization is optimized.
Scatter Plot once the visualization is optimized.


Have fun trying out this and other visualization widgets!

Classifying instances with Orange in Python

Last week we showed you how to create your own data table in Python shell. Now we’re going to take you a step further and show you how to easily classify data with Orange.

First we’re going to create a new data table with 10 fruits as our instances.

import Orange
from import *

color = DiscreteVariable("color", values=["orange", "green", "yellow"])
calories = ContinuousVariable("calories")
fiber = ContinuousVariable("fiber")
fruit = DiscreteVariable("fruit", values=["orange", "apple", "peach"])

domain = Domain([color, calories, fiber], class_vars=fruit)

data=Table(domain, [
["green", 4, 1.2, "apple"], 
["orange", 5, 1.1, "orange"],
["yellow", 4, 1.0, "peach"],
["orange", 4, 1.1, "orange"],
["yellow", 4, 1.1,"peach"],
["green", 5, 1.3, "apple"],
["green", 4, 1.3, "apple"],
["orange", 5, 1.0, "orange"],
["yellow", 4.5, 1.3, "peach"],
["green", 5, 1.0, "orange"]])


Now we have to select a model for classification. Among the many learners in Orange library, we decided to use the Tree Learner for this example. Since we’re dealing with fruits, we thought it’s only appropriate. :)

Let’s create a learning algorithm and use it to induce the classifier from the data.

tree_learner = Orange.classification.TreeLearner()
tree = tree_learner(data)

Now we can predict what variety a green fruit with 3.5 calories and 2g of fiber is with the help of our model. To do this, simply call the model and use a list of new data as argument.

print(tree(["green", 3.5, 2]))

Python returns index as a result:


To check the index, we can call class variable values with the corresponding index:


Final result:


You can use your own data set to see how this model works for different data types. Let us know how it goes! :)

Creating a new data table in Orange through Python



One of the first tasks in Orange data analysis is of course loading your data. If you are using Orange through Python, this is as easy as riding a bike:

import Orange
data =“iris”)
print (data)

This will return a neat data table of the famous Iris data set in the console.




What if you want to create your own data table from scratch? Even this is surprisingly simple. First, import the Orange data library.

from import *


Set all the attributes you wish to see in your data table. For discrete attributes call DiscreteVariable and set the name and the possible values, while for a continuous variable call ContinuousVariable and set only the attribute name.

color = DiscreteVariable(“color”, values=[“orange”, “green”, “yellow”])

calories = ContinuousVariable(“calories”)

fiber = ContinuousVariable(“fiber”)]

fruit = DiscreteVariable("fruit”, values=[”orange", “apple”, “peach”])


Then set the domain for your data table. See how we set class variable with class_vars?

domain = Domain([color, calories, fiber], class_vars=fruit)


Time to input your data!

data = Table(domain, [

[“green”, 4, 1.2, “apple”],

["orange", 5, 1.1, "orange"],

["yellow", 4, 1.0, "peach"]])


And now print what you have created!



One final step:, "")


Your data is safely stored to your computer (in the Python folder)! Good job!

Datasets in Orange Bioinformatics Add-On

As you might know, Orange comes with several basic widget sets pre-installed. These allow you to upload and explore the data, visualize them, learn from them and make predictions. However, there are also some exciting add-ons available for installation. One of these is a bioinformatics add-on, which is our specialty.


Bioinformatics widget set allows you to pursue complex analysis of gene expression by providing access to several external libraries. There are four widgets intended specifically for this – dictyExpress, GEO Data Sets, PIPAx and GenExpress. GEO Data Sets are sourced from NCBI, PIPAx and dictyExpress from two Biolab projects, and finally GenExpress from Genialis. A lot of the data is freely accessible, while you will need a user account for the rest.

Once you open the widget, select the experiments you wish to use for your analysis and view it in the Data Table widget. You can compare these experiments in Data Profiles, visualize them in Volcano Plot, select the most relevant genes in Differential Expression widget and much more.


Three widgets with experiment data libraries.


These databases enable you to start your research just by installing the bioinformatics add-on (Orange → Options → Add-ons…). The great thing is you can easily combine bioinformatics widgets with the basic pre-installed ones. What an easy way to immerse yourself in the exciting world of bioinformatics!

Visualizing Misclassifications

In data mining classification is one of the key methods for making predictions and gaining important information from our data. We would, for example, use classification for predicting which patients are likely to have the disease based on a given set of symptoms.

In Orange an easy way to classify your data is to select several classification widgets (e.g. Naive Bayes, Classification Tree and Linear Regression), compare the prediction quality of each learner with Test Learners and Confusion Matrix and then use the best performing classifier on a new data set for classification. Below we use Iris data set for simplicity, but the same procedure works just as well on all kinds of data sets.

Here we have three confusion matrices for Naive Bayes (top), Classification Tree (middle) and Logistic Regression (bottom).


Three misclassification matrices (Naive Bayes, Classification Tree and Logistic Regression)
Three misclassification matrices (Naive Bayes, Classification Tree and Logistic Regression)


We see that Classification Tree did the best with only 9 misclassified instances. To see which instances were assigned a false class, we select ‘Misclassified’ option in the widget, which highlights misclassifications and feeds them to the Scatter Plot widget. In the graph we thus see the entire data set presented with empty dots and the selected misclassifications with full dots.

Visualization of misclassified instances in scatter plot.
Visualization of misclassified instances in scatter plot.


Feel free to switch between learners in Confusion Matrix to see how the visualization changes for each of them.