10 Tips and Tricks for Using Orange

TIP #1: Follow tutorials and example workflows to get started.

It’s difficult to start using new software. Where does one start, especially a total novice in data mining? For this exact reason we’ve prepared Getting Started With Orange – YouTube tutorials for complete beginners. Example workflows on the other hand can be accessed via Help – Examples.

 

TIP #2: Make use of Orange documentation.

You can access it in three ways:

  1. Press F1 when the widget is selected. This will open help screen.
  2. Select Widget – Help when the widget is selected. It works the same as above.
  3. Visit online documentation.

 

TIP #3: Embed your help screen.

Drag and drop help screen to the side of your Orange canvas. It will become embedded in the canvas. You can also make it narrower, allowing for a full-size analysis while exploring the docs.

embed-help

 

TIP #4: Use right-click.

Right-click on the canvas and a widget menu will appear. Start typing the widget you’re looking for and press Enter when the widget becomes the top widget. This will place the widget onto the canvas immediately. You can also navigate the menu with up and down.

 

TIP #5: Turn off channel names.

Sometimes it is annoying to see channel names above widget links. If you’re already comfortable using Orange, you can turn them off in Options – Settings. Turn off ‘Show channel names between widgets’.

 

TIP #6: Hide control pane.

Once you’ve set the parameters, you’d probably want to focus just on visualizations. There’s a simple way to do this in Orange. Click on the split between the control pane and visualization pane – you should see a hand appearing instead of a cursor. Click and observe how the control pane gets hidden away neatly. To make it reappear, click the split again.

panel1

panel2

 

TIP #7: Label your data.

So you’ve plotted your data, but have no idea what you’re seeing. Use annotation! In some widgets you will see a drop-down menu called Annotation, while in others it will be called a Label. This will mark your data points with suitable labels, making your MDS plots and Scatter Plots much more informative. Scatter Plot also enables you to label only selected points for better clarity.

 

TIP #8: Find your plot.

Scrolled around and lost the plot? Zoomed in too much? To re-position the plot click ‘Reset zoom’ and the visualization will jump snugly into the visualization pane. Comes in handy when browsing the subsets and trying to see the bigger picture every now and then.

zoom-pan

 

 

TIP #9: Reset widget settings.

Orange is geared to remember your last settings, thus assisting you in a rapid analysis. However, sometimes you need to start anew. Go to Options – Reset widget settings… and restart Orange. This will return Orange to its original state.

 

TIP #10: Use Educational add-on.

To learn about how some algorithms work, use Orange3-Educational add-on. It contains 4 widgets that will help you get behind the scenes of some famous algorithms. And since they’re interactive, they’re also a lot of fun!

educational

 

 

 

 

Visualization of Classification Probabilities

This is a guest blog from the Google Summer of Code project.

 

Polynomial Classification widget is implemented as a part of my Google Summer of Code project along with other widgets in educational add-on (see my previous blog). It visualizes probabilities for two-class classification (target vs. rest) using color gradient and contour lines, and it can do so for any Orange learner.

Here is an example workflow. The data comes from the File widget. With no learner on input, the default is Logistic Regression. Widget outputs learners Coefficients, Classifier (model) and Learner.

poly-classification-flow

Polynomial Classification widget works on two continuous features only, all other features are ignored. The screenshot shows plot of classification for an Iris data set .

polynomial-classification-1-stamped

  1. Set name of the learner. This is the name of learner on output.
  2. Set features that logistic regression is performed on.
  3. Set class that is classified separately from other classes.
  4. Set the degree of a polynom that is used to transform an input data (1 means attributes are not transformed).
  5. Select whether see or not contour lines in chart. The density of contours is regulated by Contour step.

 

The classification for our case fails in separating Iris-versicolor from the other two classes. This is because logistic regression is a linear classifier, and because there is no linear combination of the chosen two attributes that would make for a good decision boundary. We can change that. Polynomial expansion adds features that are polynomial combinations of original ones. For example, if an input data contains features [a, b], polynomial expansion of degree two generates feature space [1, a, b, a2, a b, b2]. With this expansion, the classification boundary looks great.

polynomial-classification-2

 

Polynomial Classification also works well with other learners. Below we have given it a Classification Tree. This time we have painted the input data using Paint Data, a great data generator used while learning about Orange and data science. The decision boundaries for the tree are all square, a well-known limitation for tree-based learners.

poly-classification-4e

 

Polynomial expansion if high degrees may be dangerous. Following example shows overfitting when degree is five. See the two outliers, a blue one on the top and the red one at the lower right of the plot? The classifier was unnecessary able to separate the outliers from the pack, something that will become problematic when classifier will be used on the new data.

poly-classification-owerfit

Overfitting is one of the central problems in machine learning. You are welcome to read our previous blog on this problem and possible solutions.

Interactive k-Means

This is a guest blog from the Google Summer of Code project.

 

As a part of my Google Summer of Code project I started developing educational widgets and assemble them in an Educational Add-On for Orange. Educational widgets can be used by students to understand how some key data mining algorithms work and by teachers to demonstrate the working of these algorithms.

Here I describe an educational widget for interactive k-means clustering, an algorithm that splits the data into clusters by finding cluster centroids such that the distance between data points and their corresponding centroid is minimized. Number of clusters in k-means algorithm is denoted with k and has to be specified manually.

The algorithm starts by randomly positioning the centroids in the data space, and then improving their position by repetition of the following two steps:

  1. Assign each point to the closest centroid.
  2. Move centroids to the mean position of points assigned to the centroid.

The widget needs the data that can come from File widget, and outputs the information on clusters (Annotated Data) and centroids:

kmans_shema

Educational widget for k-means works finds clusters based on two continuous features only, all other features are ignored. The screenshot shows plot of an Iris data set and clustering with k=3. That is partially cheating, because we know that iris data set has three classes, so that we can check if clusters correspond well to original classes:

kmeans2-stamped

  1. Select two features that are used in k-means
  2. Set number of centroids
  3. Randomize positions of centroids
  4. Show lines between centroids and corresponding points
  5. Perform the algorithm step by step. Reassign membership connects points to nearest centroid, Recompute centroids moves centroids.
  6. Step back in the algorithm
  7. Set speed of automatic stepping
  8. Perform the whole algorithm as fast preview
  9.  Anytime we can change number of centroids with spinner or with click in desired position in the graph.

If we want to see the correspondence of clusters that are denoted by k-means and classes, we can open Data Table widget where we see that all Iris-setosas are clustered in one cluster and but there are just few Iris-versicolor that are classified is same cluster together with Iris-virginica and vice versa.

kmeans3

Interactive k-means works great in combination with Paint Data. There, we can design data sets where k-mains fails, and observe why.

kmeans-failt

We could also design data sets where k-means fails under specific initialization of centroids. Ah, I did not tell you that you can freely move the centroids and then restart the algorithm. Below we show the case of centroid initialization and how this leads to non-optimal clustering.

kmeans-f-join

Scripting with Time Variable

It’s always fun to play around with data. And since Orange can, as of a few months ago, read temporal data, we decided to parse some data we had and put it into Orange.

TimeVariable is an extended class of continuous variable and it works with properly formated ISO standard datetime (Y-M-D h:m:s). Oftentimes our original data is not in the right format and needs to be edited first, so Orange can read it. Python’s own datetime module is of great help. You can give it any date format and tell it how to interpret it in the argument.

import datetime
date = "13.03.2013 13:13:31"
new_date = str(datetime.datetime.strptime(date, "%d.%m.%Y %H:%M:%S"))
>>> '2013-03-13 13:13:31'

 

Do this for all your datetime attributes. This will transform them into strings that Orange’s TimeVariable can read. Then create a new data table:

import Orange
domain = Orange.data.Domain([TimeVariable.make("timestamp")])
timestamps = ["2013-03-13 13:13:31", "2014-04-14 14:14:41", "2015-05-15 15:15:51"]
#create a new TimeVariable object
time_var = TimeVariable()
#it's important to parse strings into floats with var.parse(i)
#list(zip(data)) then transforms the list into a 2d list of lists
time_data = Orange.data.Table(domain, list(zip(var.parse(i) for i in timestamps)))

 

Now say you have some original data you want to append your new data to.

data = Orange.data.Table.concatenate([original_data, time_data])
Table.save(data, "data.tab")

 

But what if you want to select only a few attributes from the original data? It can be arranged.

original_data = Orange.data.Table("original_data.tab")
new_domain = Domain(["attribute_1", "attribute_2"], source=original_data.domain)
new_data = Orange.data.Table(new_domain, original_data)

 

Then concatenate again:

data = Orange.data.Table.concatenate([new_data, time_data])
Table.save(data, "selected_data.tab")

 

Remember, if your data has string variables, they will always be in meta attributes.

domain = Domain(["some_attribute1", "other_attribute2"], metas=["some_string_variable"])

 

Have fun scripting!

Orange at Google Summer of Code 2016

Orange team is extremely excited to be a part of this year’s Google Summer of Code! GSoC is a great opportunity for students around the world to spend their summer contributing to an open-source software, gaining experience and earning money.

Accepted students will help us develop Orange (or other chosen OSS project) from May to August. Each student is expected to select and define a project of his/her interest and will be ascribed a mentor to guide him/her through the entire process.

 

Apply here:

https://summerofcode.withgoogle.com/

Orange’s project proposals (we accept your own ideas as well!):

https://github.com/biolab/orange3/wiki/GSoC-2016

Our GSoC community forum:

https://groups.google.com/forum/#!forum/orange-gsoc

 

Spread the word! (and don’t forget to apply 😉 )

 

1

Getting Started Series: Part Two

We’ve recently published two more videos in our Getting Started with Orange series. The series is intended to introduce beginners to Orange and teach them how to use its components.

 

The first video explains how to do hierarchical clustering and select interesting subsets directly in Orange:

 

while the second video introduces classification trees and predictive modelling:

 

The seventh video in the series will address how to score classification and regression models by different evaluation methods. Fruits and vegetables data set can be found here.

 

If you have an idea what you’d like to see in the upcoming videos, let us know!

Mining our own data

Recently we’ve made a short survey that was, upon Orange download, asking people how they found out about Orange, what was their data mining level and where do they work. The main purpose of this is to get a better insight into our user base and to figure out what is the profile of people interested in trying Orange.

Here we have some preliminary results that we’ve managed to gather in the past three weeks or so. Obviously we will use Orange to help us make sense of the data.

 

We’ve downloaded our data from Typeform and appended some background information such as OS and browser. Let’s see what we’ve got in the Data Table widget.

blog-results7

 

Ok, this is our entire data table. Here we also have the data on people who completed the survey and who didn’t. First, let’s organize the data properly. We’ll do this with Select Columns widget.

blog-results

 

We removed all the meta attributes as they are not very relevant for our analysis. Next we moved the ‘completed’ attribute into target variable, thus making it our class variable.

blog-results2

 

Now we would like to see some basic distributions from our data.

blog-results3

 

Interesting. Most of our users are working on Windows, a few on Mac and very few on Linux.

Let’s investigate further. Now we want to know more about those people who actually completed the survey. Let’s use Select Columns again, this time removing os_type, os_name, agent_name and completed from our data and keeping just the answers. We made “Where do you work?” our class variable, but we could use either one of the three. Another trick is to set in directly in Distributions widget under ‘Group by’.

blog-results4

 

Ok, let’s again use Distributions – this is such a simple way to get a good sense of your data.

blog-results5

 

Obviously out of those who found out about Orange in college, most are students, but what’s interesting here is that there are so many. We can also see that out of those who found us on the web, most come from the private sector, followed by academia and researchers. Good. How about the other question?

blog-results6

 

Again, results are not particularly shocking, but it’s great to confirm your hypothesis with real data. Out of beginner level data miners, most are students, while most intermediate users come from the industry.

A quick look at the Mosaic Display will give us a good overview:

blog-results8

 

Yup, this sums it up quite nicely. We have lots of beginner levels users and not many expert ones (height of the box). Also most people found out about Orange on the web or in college (width of the box). A thin line on the left shows apriori distribution, thus making it easier to compare expected and actual number of instances. For example, there should be at least some people who are students and have found out about Orange at a conference. But there aren’t – a contrast between how much red there should be in the box (line on the left) and how much there actually is (bigger part of the box) is quite telling. We can even select all the beginner level users who found out about Orange in college and further inspect the data, but be it enough for now.

Our final workflow:

 

blog-results12

 

Obviously, this is a very simple analysis. But even such simple tasks are never boring with good visualization tools such as Distributions and Mosaic Display. You could also use Venn Diagram to find common features of selected subsets or perhaps Sieve Diagram for probabilities.

 

We are very happy to get these data and we would like to thank everyone who completed the survey. If you wish to help us further, please fill out a longer survey that won’t actually take you more than 3 minutes of your time (we timed it!).

 

Happy Friday everyone!

Learners in Python

We’ve already written about classifying instances in Python. However, it’s always nice to have a comprehensive list of classifiers and a step-by-step procedure at hand.

 

TRAINING THE CLASSIFIER

We start with simply importing Orange module into Python and loading our data set.

>>> import Orange
>>> data = Orange.data.Table("titanic")

We are using ‘titanic.tab’ data. You can load any data set you want, but it does have to have a categorical class variable (for numeric targets use regression). Now we want to train our classifier.

>>> learner = Orange.classification.LogisticRegressionLearner()
>>> classifier = learner(data)
>>> classifier(data[0])

Python returns the index of the value, as usual.

array[0.]

To check what’s in the class variable we print:

>>>print("Name of the variable: ", data.domain.class_var.name)
>>>print("Class values: ", data.domain.class_var.values)
>>>print("Value of our instance: ", data.domain.class_var.values[0])

Name of the variable: survived
Class values: no, yes
Value of our instance: no

 

PREDICTIONS

If you want to get predictions for the entire data set, just give the classifier the entire data set.

>>> classifier(data)

array[0, 0, 0, ..., 1, 1, 1]

If we want to append predictions to the data table, first use classifier on the data, then create a new domain with an additional meta attribute and finally form a new data table with appended predictions:

svm = classifier(data)

new_domain = Orange.data.Domain(data.domain.attributes, data.domain.class_vars, [data.domain.class_var])

table2 = Orange.data.Table(new_domain, data.X, data.Y, svm.reshape(-1, 1))

We use .reshape to transform vector data into a reshaped array. Then we print out the data.

print(table2)

 

PARAMETERS

Want to use another classifier? The procedure is the same, simply use:

Orange.classification.<algorithm-name>()

For most classifiers, you can set a whole range of parameters. Logistic Regression, for example, uses the following:

learner = Orange.classification.LogisticRegressionLearner(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, preprocessors=None)

To check the parameters for the classifier, use:

print(Orange.classification.SVMLearner())

 

PROBABILITIES

Another thing you can check with classifiers are the probabilities.

classifier(data[0], Orange.classification.Model.ValueProbs)

>>> (array([ 0.]), array([[ 1.,  0.]]))

The first array is the value for your selected instance (data[0]), while the second array contains probabilities for class values (probability for ‘no’ is 1 and for ‘yes’ 0).

 

CLASSIFIERS

And because we care about you, we’re giving you here a full list of classifier names:

LogisticRegressionLearner()

NaiveBayesLearner()

KNNLearner()

TreeLearner()

MajorityLearner()

RandomForestLearner()

SVMLearner()

 

For other learners, you can find all the parameters and descriptions in the documentation.

 

Data Mining Course in Houston

We have just completed an Introduction to Data Mining, a graduate course at Baylor College of Medicine in Texas, Houston. The course was given in September and consisted of seven two-hour lectures, each one followed with a homework assignment. The course was attended by about 40 students and some faculty and research staff.

dm-course-baylor

This was a challenging course. The audience was new to data mining, and we decided to teach them with the newest, third version of Orange. We also experimented with two course instructors (Blaz and Janez), who, instead of splitting the course into two parts, taught simultaneously, one on the board and the other one helping the students with hands-on exercises. To check whether this worked fine, we ran a student survey at the end of the course. We used Google Sheets and then examined the results with students in the class. Using Orange, of course.

bcm-course-gsheets

And the outcome? Looks like the students really enjoyed the course

bcm-course-toafriend

and the teaching style.

bcm-course-teachingstyle

The course took advantage of several new widgets in Orange 3, including those for data preprocessing and polynomial regression. The core development team put a lot of effort during the summer to debug and polish this newest version of Orange. Also thanks to the financial support by AXLE EU FP7 and CARE-MI EU FP7 grants and grants by the Slovene Research agency, we were able to finish everything in time.

Scatter Plot Projection Rank

One of the nicest and surely most useful visualization widgets in Orange is Scatter Plot. The widget displays a 2-D plot, where x and y-axes are two attributes from the data.

2-dimensional scatter plot visualization
2-dimensional scatter plot visualization

 

Orange 2.7 has a wonderful functionality called VizRank, that is now implemented also in Orange 3. Rank Projections functionality enables you to find interesting attribute pairs by scoring their average classification accuracy. Click ‘Start Evaluation’ to begin ranking.

Rank Projections before ranking is performed.
Rank Projections before ranking is performed.

 

The functionality will also instantly adapt the visualization to the best scored pair. Select other pairs from the list to compare visualizations.

Rank Projections once the attribute pairs are scored.
Rank Projections once the attribute pairs are scored.

 

Rank suggested petal length and petal width as the best pair and indeed, the visualization below is much clearer (better separated).

Scatter Plot once the visualization is optimized.
Scatter Plot once the visualization is optimized.

 

Have fun trying out this and other visualization widgets!