Ok, we’ve just recently stumbled across an interesting article on how to deal with non normal (non-Gaussian distributed) data.

We have an absolutely paranormal data set of 20 persons with weight, height, paleness, vengefulness, habitation and age attributes (download).


Let’s check the distribution in Distributions widget.


Our first attribute is “Weight” and we see a little hump on the left. Otherwise the data would be normally distributed. Ok, so perhaps we have a few children in the data set. Let’s check the age distribution.

Whoa, what? Why is the hump now on the right? These distributions look scary. We seem to have a few reaaaaally old people here. What is going on? Perhaps we can figure this out with MDS. This widget projects the data into two dimensions so that the distances between the points correspond to differences between the data instances.


Aha! Now we see that three instances are quite different from all others. Select them and send them to the Data Table for final inspection.


Busted! We have found three ghosts hiding in our data. They are extremely light (the sheet they are wearing must weight around 2kg), quite vengeful and old.

Now, joke aside, what would this mean for a general non-normally distributed data? One thing is your data set might be too small. Here we only have 20 instances, thus 3 outlying ghosts have a great impact on the distribution. It is difficult to hide 3 ghosts among 17 normal persons.

Secondly, why can’t we use Outliers widget to hunt for those ghosts? Again, our data set is too small. With just 20 instances, the estimation variance is so large that it can easily cover a few ghosts under its sheet. We don’t have enough “normal” data to define what is normal and thus detect the paranormal.

Haven’t we just written two exactly opposite things? Perhaps.

Happy Halloween everybody! :)


SQL for Orange

We bet you’ve always wanted to use your SQL data in Orange, but you might not be quite sure how to do it. Don’t worry, we’re coming to the rescue.

The key to SQL files is installation of ‘psycopg2‘ library in Python.



Go to this website and download psycopg2 package. Once your .whl file has downloaded, go to the file directory and run command prompt. Enter “pip install [file name]” and run it.



If you’re on Mac or Linux, install psycopg2 with this.



Upon opening Orange, you will be able to see a lovely new icon – SQL Table. Then just connect to your server and off you go!


Learners in Python

We’ve already written about classifying instances in Python. However, it’s always nice to have a comprehensive list of classifiers and a step-by-step procedure at hand.



We start with simply importing Orange module into Python and loading our data set.

>>> import Orange
>>> data ="titanic")

We are using ‘’ data. You can load any data set you want, but it does have to have a categorical class variable (for numeric targets use regression). Now we want to train our classifier.

>>> learner = Orange.classification.LogisticRegressionLearner()
>>> classifier = learner(data)
>>> classifier(data[0])

Python returns the index of the value, as usual.


To check what’s in the class variable we print:

>>>print("Name of the variable: ",
>>>print("Class values: ", data.domain.class_var.values)
>>>print("Value of our instance: ", data.domain.class_var.values[0])

Name of the variable: survived
Class values: no, yes
Value of our instance: no



If you want to get predictions for the entire data set, just give the classifier the entire data set.

>>> classifier(data)

array[0, 0, 0, ..., 1, 1, 1]

If we want to append predictions to the data table, first use classifier on the data, then create a new domain with an additional meta attribute and finally form a new data table with appended predictions:

svm = classifier(data)

new_domain =, data.domain.class_vars, [data.domain.class_var])

table2 =, data.X, data.Y, svm.reshape(-1, 1))

We use .reshape to transform vector data into a reshaped array. Then we print out the data.




Want to use another classifier? The procedure is the same, simply use:


For most classifiers, you can set a whole range of parameters. Logistic Regression, for example, uses the following:

learner = Orange.classification.LogisticRegressionLearner(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, preprocessors=None)

To check the parameters for the classifier, use:




Another thing you can check with classifiers are the probabilities.

classifier(data[0], Orange.classification.Model.ValueProbs)

>>> (array([ 0.]), array([[ 1.,  0.]]))

The first array is the value for your selected instance (data[0]), while the second array contains probabilities for class values (probability for ‘no’ is 1 and for ‘yes’ 0).



And because we care about you, we’re giving you here a full list of classifier names:









For other learners, you can find all the parameters and descriptions in the documentation.


Data Mining Course in Houston

We have just completed an Introduction to Data Mining, a graduate course at Baylor College of Medicine in Texas, Houston. The course was given in September and consisted of seven two-hour lectures, each one followed with a homework assignment. The course was attended by about 40 students and some faculty and research staff.


This was a challenging course. The audience was new to data mining, and we decided to teach them with the newest, third version of Orange. We also experimented with two course instructors (Blaz and Janez), who, instead of splitting the course into two parts, taught simultaneously, one on the board and the other one helping the students with hands-on exercises. To check whether this worked fine, we ran a student survey at the end of the course. We used Google Sheets and then examined the results with students in the class. Using Orange, of course.


And the outcome? Looks like the students really enjoyed the course


and the teaching style.


The course took advantage of several new widgets in Orange 3, including those for data preprocessing and polynomial regression. The core development team put a lot of effort during the summer to debug and polish this newest version of Orange. Also thanks to the financial support by AXLE EU FP7 and CARE-MI EU FP7 grants and grants by the Slovene Research agency, we were able to finish everything in time.

A visit from the Tilburg University

Biolab is currently hosting two amazing data scientists from the Tilburg University – dr. Marie Nilsen and dr. Eric Postma, who are preparing a 20-lecture MOOC on data science for non-technical audience. A part of the course will use Orange. The majority of their students is coming from humanities, law, economy and behavioral studies, thus we are discussing options and opportunities for adapting Orange for social scientists. Another great thing is that the course is designed for beginner level data miners, showcasing that anybody can mine the data and learn from it. And then consult with statisticians and data mining expert (of course!).

Biolab team with Marie and Eric, who is standing next to Ivan Cankar - the very serious guy in the middle.
Biolab team with Marie and Eric, who is standing next to Ivan Cankar – the very serious guy in the middle.


To honor this occasion we invite you to check out the Polynomial regression widget, which is specially intended for educational use. There, you can showcase the problem of overfitting through visualization.

First, we set up a workflow.


Then we paint, say, at most 10 points into the Paint Data widget. (Why at most ten? You’ll see later.)



Now we open our Polynomial Regression widget, where we play with polynomial degree. Polynomial Degree 1 gives us a line. With coefficient 2 we get a curve that fits only one point. However, with the coefficient 7 we fit all the points with one curve. Yay!





But hold on! The curve now becomes very steep. Would the lower end of the curve at about (0.9, -2.2) still be a realistic estimate of our data set? Probably not. Even when we look at the Data Table with coefficient values, they seem to skyrocket.



This is a typical danger of overfitting, which is often hard to explain, but with the help of these three widgets becomes as clear as day!
Now go out and share the knowledge!!!

Save your graphs!

If you are often working with Orange, you probably have noticed a small button at the bottom of most visualization widgets. “Save Graph” now enables you to export graphs, charts, and hierarchical trees to your computer and use them in your reports. Because people need to see it to believe it!

“Save Graph” will save visualizations to your computer.


Save Graph function is available in Paint Data, Image Viewer, all visualization widgets, and a few others (list is below).

Widgets with the “Save Graph” option.


You can save visualizations in .png, .dot or .svg format. However – brace yourselves – our team is working on something even better, which will be announced in the following weeks.

Hubbing with the Hub widget

So you have painted two data sets and loaded another one from a file, and now you are testing predictions of logistic regression, classification trees and SVM on it? Tired of having to reconnect the Paint data widget and the File widget back and forth whenever you switch between them?

Say no more! Look no further! Here is the new Hub widget!

Multiple file inputs


Hub widget is the most versatile widget available so far. It accepts several inputs of any type and outputs them to as many other widgets as you want.

The Hub widget treats all types with the strictest equality.

(It also adheres to all applicable EU policies with respect to gender equality, and does not use cookies.)

Diverse widget input

The Hub widget works like charm and is like the amazing cast-to-void-and-back-to-anything idiom in C. This strongful MacGyver of widgets can (almost) convert classification tree into data, or preprocessor into experimental results without ever touching the data. With its amazing capabilities, the Hub widget has the potential to cause an even greater havoc in your workflows than the famous Merge data widget.

Download, install – and start hubbing today !!


Updated Widget Documentation

Happy news for all passionate Orange users! We’ve uploaded documentation for our Orange 3 widget selection.


Right click and select "Help" or press F1.
Right click and select “Help” or press F1.


It’s easy to use. To learn more about a particular wigdet, click on the widget. Either use right click and select “Help” or press F1. A new window will open with a widget description and an example for its use. There are also screenshots included as visual help.


Widget documentation.
Widget documentation.


We are going to be updating documentation as the widgets continue to develop. Documentation for bioinformatics and data fusion add-ons is expected to be up and running in the following week.

Scatter Plot Projection Rank

One of the nicest and surely most useful visualization widgets in Orange is Scatter Plot. The widget displays a 2-D plot, where x and y-axes are two attributes from the data.

2-dimensional scatter plot visualization
2-dimensional scatter plot visualization


Orange 2.7 has a wonderful functionality called VizRank, that is now implemented also in Orange 3. Rank Projections functionality enables you to find interesting attribute pairs by scoring their average classification accuracy. Click ‘Start Evaluation’ to begin ranking.

Rank Projections before ranking is performed.
Rank Projections before ranking is performed.


The functionality will also instantly adapt the visualization to the best scored pair. Select other pairs from the list to compare visualizations.

Rank Projections once the attribute pairs are scored.
Rank Projections once the attribute pairs are scored.


Rank suggested petal length and petal width as the best pair and indeed, the visualization below is much clearer (better separated).

Scatter Plot once the visualization is optimized.
Scatter Plot once the visualization is optimized.


Have fun trying out this and other visualization widgets!

Classifying instances with Orange in Python

Last week we showed you how to create your own data table in Python shell. Now we’re going to take you a step further and show you how to easily classify data with Orange.

First we’re going to create a new data table with 10 fruits as our instances.

import Orange
from import *

color = DiscreteVariable("color", values=["orange", "green", "yellow"])
calories = ContinuousVariable("calories")
fiber = ContinuousVariable("fiber")
fruit = DiscreteVariable("fruit", values=["orange", "apple", "peach"])

domain = Domain([color, calories, fiber], class_vars=fruit)

data=Table(domain, [
["green", 4, 1.2, "apple"], 
["orange", 5, 1.1, "orange"],
["yellow", 4, 1.0, "peach"],
["orange", 4, 1.1, "orange"],
["yellow", 4, 1.1,"peach"],
["green", 5, 1.3, "apple"],
["green", 4, 1.3, "apple"],
["orange", 5, 1.0, "orange"],
["yellow", 4.5, 1.3, "peach"],
["green", 5, 1.0, "orange"]])


Now we have to select a model for classification. Among the many learners in Orange library, we decided to use the Tree Learner for this example. Since we’re dealing with fruits, we thought it’s only appropriate. :)

Let’s create a learning algorithm and use it to induce the classifier from the data.

tree_learner = Orange.classification.TreeLearner()
tree = tree_learner(data)

Now we can predict what variety a green fruit with 3.5 calories and 2g of fiber is with the help of our model. To do this, simply call the model and use a list of new data as argument.

print(tree(["green", 3.5, 2]))

Python returns index as a result:


To check the index, we can call class variable values with the corresponding index:


Final result:


You can use your own data set to see how this model works for different data types. Let us know how it goes! :)