Upcoming Orange Data Science Course in Ljubljana

From 15th to 19th July 2019, Orange team will hold an introductory data science course at this year’s Doctoral Summer School, organized by the School of Economics and Business, University of Ljubljana. This is the second year we will be a part of this summer school. Like the previous year, we will cover a wide variety of topics, from exploratory analysis and clustering techniques to predictive modeling and data projections. Applications are open to PHD students, post-docs, academics and professionals by the end of June.

 

What: Practical Introduction to Machine Learning and Data Analytics.

Course description here.

When: 15 – 19 July 2019

Who: Blaž Zupan, Marko Toplak, Ajda Pretnar

Credits: 4 ECTS

Apply here.

 

Don’t forget to check the other courses as well!

Gene Expression Profiles with Line Plot

Line Plot is one of our recent additions to the visualization widgets. It shows data profiles, meaning it plots values for all features in the data set. Each data instance in a line plot is a line or a ‘profile’.

The widget can show four types of information – individual data profiles (lines), data range, mean profile and error bars. It has the same cool features of other Orange visualizations – it is interactive, meaning you can select a subset of data instances from the plot, it allows grouping by a discrete variable, and it highlights an incoming data subset.

Related: Scatter Plot: The Tour

Let us check a simple example. We will use brown-selected data, which is a data on gene expression of baker’s yeast. To observe gene expression profiles, we will use Line Plot.

Since the data has class, which represents a function of the gene, Line Plot will automatically group by class variable. It seems like protease, respiratory and ribosome genes have quite distinctive profiles! Let us select the most interesting region in the plot by selecting the zoom tool and dragging across the area of interest.

We see that spo-mid feature distinguishes really well between protease and two other gene types and that values of protease are normally high for spo-mid.

Another thing we can do is select a subset from the plot. If we press the ‘rectangle’ icon on the left, our plot will be automatically resized to the original size. Then we press the ‘arrow’ icon, which will put us back to the selecting mode. Now let us select Lines instead of Range and Mean for display. This will show individual expression profiles.

If we click and drag across an area of interest, instances under the thick black line will be selected. We can connect, say a Box Plot to the Line Plot and observe the distribution of the selected subset. Unsurprisingly, the genes we have selected are mostly protease.

This is it. Line Plot is really simple to use and can reveal many interesting things not only for biologists, but for any kind of data analyst. Next week we will talk about how to work with timeseries data in combination with the Line Plot.

Business Case Studies with Orange

Previous week Blaz, Robert and I visited Wärtsilä in the lovely Dolina near Trieste, Italy. Wärtsilä is one of the leading designers of lifecycle power solutions for the global marine and energy markets and its subsidiary in Trieste is one of the largest Wärtsilä Group engine production plants. We were there to hold a one-day workshop on data mining and machine learning with the aim to identify relevant use cases in business and show how to address them.

Related: Data Mining for Business and Public Administration

One such important use case is employee attrition. It is vital for any company to retain its most valuable workers, so they must learn how to identify dissatisfied employees and provide incentive from the to stay. It is easy to construct a workflow in Orange that helps us with this.

First, let us load Attrition – Train data set from the Datasets widget. This is a synthetic data set from IBM Watson that has 1470 instances (employees) and 18 features describing them. Our target variable is Attrition, where Yes means the person left the company and No means it stayed.

Now our goal is to construct a predictive model that will successfully predict the likelihood of a person leaving. Let us connect a couple of classifiers and the data set to Test and Score and see which model performs best.

Seems like Logistic Regression is the winner here, since its AUC score it the highest of the three.

A great thing about Logistic Regression is that it is interpretable. We can connect the data from Datasets to Logistic Regression and the resulting model from LR to Nomogram. Nomogram shows the top ten features, ranked by their contribution to the final probability of a class.

The length of a line corresponds to the relative importance of the attribute. Seems like recently hired employees are more likely to leave (YearsAtCompany goes towards 0). We also should consider promoting those that haven’t been promoted in a while (YearsSinceLastPromotion goes towards 15) and cut the overtime (OverTime is Yes). Model inspection helps us identify relevant attributes and interpret their values. So useful for HR departments!

Finally, we can take new data and predict the likelihood for leaving. Put another Datasets on the canvas and load Attrition – Predict data. This one contains only three instances – say the data for three employees we have forgotten to consider in our training data.

So who is more likely to leave? We obviously cannot afford to promote everyone, because this costs money. We need to optimize our decisions so that we both increase the satisfaction of employees while keep our costs low. This is where we can use predictive modeling. Connect Logistic Regression to Predictions widget. Then connect the second Datasets widget with the new data to Predictions as well.

Seems like John is most likely to leave. He has been at the company for only a year and he works overtime.

This is something HR department can work with to design proper policies and keep best talent. The same workflow can be used for churn prediction, process optimization and predicting success of a new product.

Orange at GIS Ostrava

Ostrava is a city in the north-east of the Czech Republic and is the capital of the Moravian-Silesian Region. GIS Ostrava is a yearly conference organized by Jiří Horák and his team at the Technical University of Ostrava. University has a nice campus with a number of new developments. I have learned that this is the largest university campus in central and eastern Europe, as most of the universities, like mine, are city universities with buildings dispersed around the city.

During the conference, I gave an invited talk on “Data Science for Everyone” and showed how Orange can be used to teach basic data science concepts in a few hours so that the trainee can gain some intuition about what data science is and then, preferably, use the software on their own data. To prove this concept, I gave an example workshop during the next day of the conference. The workshop was also attended by several teachers that are thinking of incorporating Orange within their data science curricula.

Admittedly, there was not much GIS in my presentations, as I – as planned – focused more on data science. But I did include an example of how to project the data in Orange to geographical maps. The example involved the analysis of Human Development Index data and clustering. When projected to the map, the results of clustering could be unexpected if we select only the features that address quality of life: check out the map below and try to figure out what is wrong.

Here, I would like to thank Igor Ivan and Jiri Horak for the invitation, and their group and specifically Michal Kacmarik for the hospitality.

The Changing Status Bar

Every week on Friday, when the core team of Orange developers meets, we are designing new improvements of Orange’s graphical interface. This time, it was the status bar. Well, actually, it was the status bar quite a while ago and required the change of the core widget library, but it is materializing these days and you will see the changes in the next release.

Consider the Neighbors widget. The widget considers the input data and reference data items, and outputs instance form input data that are most similar to the references. Like, if the dolphin is a reference, we would like to know which are the three most similar animals. But this is not what want I wanted to write about. I would only like to say that we are making a slight change in the user interface. Below is the Neighbors widget in the current release of Orange, and the upcoming one.

See the difference? We are getting rid of the infobox on the top of the control tab, and moving it to the status bar. In the infobox widgets typically display what is in their input and what is on the output after the data has been processed. Moving this information to the status bar will make widgets more compact and less cluttered. We will similarly change the infoboxes in this way in all of the widgets.

Single-Cell Data Science for Everyone

Molecular biologists have in the past twenty years invented technologies that can collect abundant experimental data. One such technique is single-cell RNA-seq, which, very simplified, can measure the activity of genes in possibly large collections of cells. The interpretation of such data can tell us about the heterogeneity of cells, cell types, or provide information on their development.

Typical analysis toolboxes for single-cell data are available in R and Python and, most notably, include Seurat and scanpy, but they lack interactive visualizations and simplicity of Orange. Since the fall of 2017, we have been developing an extension of Orange, which is now (almost) ready. It has even been packed into its own installer. The first real test of the software was in early 2018 through a one day workshop at Janelia Research Campus. On March 6, and with a much more refined version of the software, we have now repeated the hands-on workshop at the University of Pavia.

The five-hour workshop covered both the essentials of data analysis and single cell analytics. The topics included data preprocessing, clustering, and two-dimensional embedding, as well as working with marker genes, differential expression analysis, and interpretation of clusters through gene ontology analysis.

I want to thank Prof. Dr. Riccardo Bellazzi and his team for the organization, and Erasmus program for financial support. I have been a frequent guest to Pavia, and learn something new every time I am there. Besides meeting new students and colleagues that attended the workshop and hearing about their data analysis challenges, this time I have also learned about a dish I had never had before in all my Italian travels. For one of the dinners (thank you, Michela) we had Pizzoccheri. Simply great!

The Mystery of Test & Score

Test & Score is surely one the most used widgets in Orange. Fun fact: it is the fourth in popularity, right after Data Table, File and Scatter Plot. So let us dive into the nuts and bolts of the Test & Score widget.

The widget generally accepts two inputs – Data and Learner. Data is the data set that we will be using for modeling, say, iris.tab that is already pre-loaded in the File widget. Learner is any kind of learning algorithm, for example, Logistic Regression. You can only use those learners that support your type of task. If you wish to do classification, you cannot use Linear Regression and for regression you cannot use Logistic Regression. Most other learners support both tasks. You can connect more than one learner to Test & Score.

Test & Score will now use each connected Learner and the Data to build a predictive model. Models can be built in different ways. The most typical procedure is cross validation, which splits the data into k folds and uses k – 1 folds for training and the remaining fold for testing. This procedure is repeated, so that each fold has been used for testing exactly once. Test & Score will then report on the average accuracy of the model.

You can also use Random Sampling, which will split the data into two sets with predefined proportions (e.g. 66% : 34%), build a model on the first set and test it on the second. This is similar to CV, except that each data instance can be used more than once for testing.

Leave one out is again very similar to the above two methods, but it only takes one data instance for testing each time. If you have a 100 data instances, then 99 will be used for training and 1 for testing, and the procedure will be repeated a 100 times until every data instance was used once for testing. As you can imagine, this is a very time-intensive procedure and it is recommended for smaller data sets only.

Test on train data uses the whole data set for training and again the same data for testing. Because of overfitting, this will usually overestimate the performance! Test on test data requires an additional data input (Test Data) and allows the user to control both data sets (training and testing) used for evaluation.

Finally, you can also use cross validation by feature. Sometimes, you would have pre-defined folds for a procedure, that you wish to replicate. Then you can use Cross validation by feature to ensure data instances are split into the same folds every time. Just make sure the feature you are using for defining folds is a categorical variable and located in meta attributes.

Another scenario is when you have several examples from the same object, for example several measurements of the same patient or several images of the same plant. Then you absolutely want to make sure that all data instances for a particular object are in the same fold. Otherwise, your model would probably report severely overfitted scores.

How to Abuse p-Values in Correlations

In a parallel universe, not so far from ours, Orange’s Correlation widget looks like this.

Quite similar to ours, except that this one shows p-values instead of correlation coefficients. Which is actually better, isn’t it? I mean, we have all attended Statistics 101, and we know that you can never trust correlation coefficients without looking at p-values to check that these correlations are real, right? So why on Earth doesn’t Orange show them?

First a side note. It was Christmas not long ago. Let’s call a ceasefire on the frequentist vs. Bayesian war. Let us, for Christ’s sake, pretend, pardon, agree that null-hypothesis testing is not wrong per se.

The mantra of null-hypothesis significance testing goes like this:

1. Form hypothesis.
2. Collect data.
3. Test hypothesis.

In contrast, the parallel-universe Correlation widget is (ab)used like this:

1. Collect data.
2. Test all possible hypotheses.
3. Cherry pick those that are confirmed.

This is like the Texas sharpshooter who fires first and then draws targets around the shots. You should never formulate hypothesis based on some data and then use this same data to prove it. Because it usually (surprise!) works.

Illustration by Dirk-Jan Hoek (CC-BY).

 

Back to the above snapshot. It shows correlations between 100 vegetables based on 100 different measurements (Ca and Mg content, their consumption in Finland, number of mentions in Star Trek DS9 series, likelihood of finding it on the Mars, and so forth). In other words, it’s all made it up. Just a 100×100 matrix of random numbers with column labels from the simple Wikipedia list of vegetables. Yet the similarity between mung bean and sunchokes surely cannot be dismissed (p < 0.001). Those who like bell pepper should try cilantro, too, because it’s basically one and the same thing (p = 0.001). And I honestly can’t tell black bean from wasabi (p = 0.001).

Here are the p-values for the top 100 most correlated pairs.

import numpy as np
import scipy as sp
a = np.random.random((100, 100))
sorted(stats.pearsonr(a[i], a[j])[1] for i in range(100) for j in range(i))[:100]
[0.0002774329730584203, 0.0004158786523819104, 0.0005008536192579852,
0.0007211022164265075, 0.0008268675086438253, 0.0010265740674904762,
(...91 values omitted to reduce the nonsense)
0.01844720610938738, 0.018465602922746942, 0.018662079618069056]

First 100 correlations are highly significant.

To learn a lesson we may have failed to grasp at the NHST 101 class, consider that there are 100 * 99 / 2 pairs. What is the significance of the pair at 5-th percentile?

correlations = sorted(stats.pearsonr(a[i], a[j])[1] for i in range(100) for j in range(i))
npairs = 100 * 99 / 2
print(correlations[int(pairs * 0.05)]
0.0496868751692227

Roughly 0.05. This is exactly what should have happened, because:

correlations[int(npairs * 0.10)]
0.10004180592217532
correlations[int(npairs * 0.15)]
0.15236602574520097
correlations[int(npairs * 0.30)]
0.3026816170584785

This proves only that p-values for the Pearson correlation coefficient are well calibrated (and that Mersenne twister that is used to generate random numbers in numpy works well). In theory, the p-value for a certain statistics (like Pearson’s r) is the probability of getting such or even more extreme value if the null-hypothesis (of no correlation, in our case) is actually true. 5 % of random hypotheses should have a p-value below 0.05, 10 % a value below 10, and 23 % a value below 23.

Imagine what they can do with the Correlations widget in the parallel universe! They compute correlations between all pairs, print out the first 5 % of them and start writing a paper without bothering to look at p-values at all. They know they should be statistically significant even if the data is random.

Which is precisely the reason why our widget must not compute p-values: because people would use it for Texas sharpshooting. P-values make sense only in the context of the proper NHST procedure (still pretending for the sake of Christmas ceasefire). They cannot be computed using the data on which they were found.

If so, why do we have the Correlation widget at all if it’s results are unpublishable? We can use it to find highly correlated pairs in a data sample. But we can’t just attach p-values to them and publish them. By finding these pairs (with assistance of Correlation widget) we just formulate hypotheses. This is only step 1 of the enshrined NHST procedure. We can’t skip the other two: the next step is to collect some new data (existing data won’t do!) and then use it to test the hypotheses (step 3).

Following this procedure doesn’t save us from data dredging. There are still plenty of ways to cheat. It is the most tempting to select the first 100 most correlated pairs (or, actually, any 100 pairs), (re)compute correlations on some new data and publish the top 5 % of these pairs. The official solution for this is a patchwork of various corrections for multiple hypotheses testing, but… Well, they don’t work, but we should say no more here. You know, Christmas ceasefire.

Scatter Plots: the Tour

Scatter plots are surely one of the best loved visualizations in Orange. Very often, when we teach, people go back to scatter plots over and over again to see their data. We took people’s love for scatter plots to the heart and we redesigned them a bit to make them even more friendly.

Our favorite still remains the Informative Projections button. This button helps you find interesting visualizations from all the combinations of your data variables. But what does interesting mean? Well, let us look at an example. Which of the two visualizations tells you more about the data?

We’d say it is the right one. Why? Because now we know that the combination of petal length and petal width nicely separates the classes!

Of course, Informative Projections button will only work when you have set a class (target) variable.

In scatter plot, you can set also the color of the data points (class is selected by default), the size of the points and the shape. This means you can add three new layers of information to your data, but we warn you not to overuse them. This usually looks very incomprehensible, even though it packs a lot of information.

You might notice, that in the current version of Orange, you can no longer select discrete attributes in Scatter Plot. This is entirely intentional. Scatter plots are best at showing the relationship between two numeric variables, such as in the two examples above. Categorical variables are much better represented with Box Plots, histograms (in Distributions) or in Mosaic Display.

   

Above, we have presented the same information for titanic data set in different visualizations, that are particularly suitable for categorical variables.

Scatter plot also enables so cool tricks. Just like in most visualizations in Orange, I can select a part of the data and observe the subset downstream. Or the other way around. I have a particular subset I wish to observe and I can pass it to Scatter Plot widget, which will highlight selected data instances.

This is also true for all other point-based visualizations in Orange, namely t-SNE, MDS, Radviz, Freeviz, and Linear Projection.

You can see there are many great thing you can do with Scatter Plot. Finally, we have added a nice touch to the visualization.

Yes, setting the size of the attribute is now animated! 🙂

Happy holidays, everyone!

Orange is Getting Smarter

In the past few months, Orange has been getting smarter and sleeker.

Since version 3.15.0, Orange remembers which distinct widgets users like to connect, adjusting the sorting on the widget search menu accordingly. Additionally, there is a new look for the Edit Links window coming soon.

Orange recently implemented a basic form of opt-in usage tracking, specifically targeting how users add widgets to the canvas.

Word cloud of widget popularity in Orange.

 

The information is collected anonymously for the users that opted-in. We will use this data to improve the widget suggestion system. Furthermore, the data provides us the first insight into how users interact with Orange. Let’s see what we’ve found out from the data recorded in the past few weeks.

 

There are four different ways of adding a widget to the canvas,

  • clicking it in the sidebar,
  • dragging it from the sidebar,
  • searching for it by right-clicking on canvas,
  • extending the workflow by dragging the communication channel from a widget.

 

A workflow extend action.

 

Among Orange users, the most popular way of adding a new widget is by dragging the communication line from the output widget – we think this is the most efficient way of using Orange too. However, the patterns vary among different widgets.

How users add widgets to canvas, from 20,775 add widget events.

 

Users tend to add root nodes such as File via a click or drag from the sidebar, while adding leaf nodes such as Data Table via extension from another widget.

How users add File to canvas.

How users add Data Table to canvas.

 

The widget popularity contest goes to: Data Table! Rightfully so, one should always check their data with Data Table.

Widget popularity visualization in Box Plot.

 

52% of sessions tracked consisted of no widgets being added (the application just being opened and closed). While some people might really like watching the loading screen, most of these are likely due to the fact that usage is not tracked until the user explicitly opts in.

 

Each bit of collected data comes at a cost to the privacy of the user. Care was put into minimizing the intrusiveness of data collection methods, while maximizing the usefulness of the collected data.

Initially, widget addition events were planned to include a ‘time since application start’ value, in order to be able to plot a user’s actions as a function of time. While this would be cool, it was ultimately decided that its usefulness is outweighed by the privacy cost to users.

 

For the keen, data is gathered per canvas session, in the following structure:

  • Date
  • Orange version
  • Operating system
  • Widget addition events, each entailing:
    • Widget name
    • Type of addition (Click, Drag, Search or Extend)
    • (Other widget name), if type is Extend
    • (Query), if type is Search or Extend