Business Case Studies with Orange

Previous week Blaz, Robert and I visited Wärtsilä in the lovely Dolina near Trieste, Italy. Wärtsilä is one of the leading designers of lifecycle power solutions for the global marine and energy markets and its subsidiary in Trieste is one of the largest Wärtsilä Group engine production plants. We were there to hold a one-day workshop on data mining and machine learning with the aim to identify relevant use cases in business and show how to address them.

Related: Data Mining for Business and Public Administration

One such important use case is employee attrition. It is vital for any company to retain its most valuable workers, so they must learn how to identify dissatisfied employees and provide incentive from the to stay. It is easy to construct a workflow in Orange that helps us with this.

First, let us load Attrition – Train data set from the Datasets widget. This is a synthetic data set from IBM Watson that has 1470 instances (employees) and 18 features describing them. Our target variable is Attrition, where Yes means the person left the company and No means it stayed.

Now our goal is to construct a predictive model that will successfully predict the likelihood of a person leaving. Let us connect a couple of classifiers and the data set to Test and Score and see which model performs best.

Seems like Logistic Regression is the winner here, since its AUC score it the highest of the three.

A great thing about Logistic Regression is that it is interpretable. We can connect the data from Datasets to Logistic Regression and the resulting model from LR to Nomogram. Nomogram shows the top ten features, ranked by their contribution to the final probability of a class.

The length of a line corresponds to the relative importance of the attribute. Seems like recently hired employees are more likely to leave (YearsAtCompany goes towards 0). We also should consider promoting those that haven’t been promoted in a while (YearsSinceLastPromotion goes towards 15) and cut the overtime (OverTime is Yes). Model inspection helps us identify relevant attributes and interpret their values. So useful for HR departments!

Finally, we can take new data and predict the likelihood for leaving. Put another Datasets on the canvas and load Attrition – Predict data. This one contains only three instances – say the data for three employees we have forgotten to consider in our training data.

So who is more likely to leave? We obviously cannot afford to promote everyone, because this costs money. We need to optimize our decisions so that we both increase the satisfaction of employees while keep our costs low. This is where we can use predictive modeling. Connect Logistic Regression to Predictions widget. Then connect the second Datasets widget with the new data to Predictions as well.

Seems like John is most likely to leave. He has been at the company for only a year and he works overtime.

This is something HR department can work with to design proper policies and keep best talent. The same workflow can be used for churn prediction, process optimization and predicting success of a new product.

Orange at GIS Ostrava

Ostrava is a city in the north-east of the Czech Republic and is the capital of the Moravian-Silesian Region. GIS Ostrava is a yearly conference organized by Jiří Horák and his team at the Technical University of Ostrava. University has a nice campus with a number of new developments. I have learned that this is the largest university campus in central and eastern Europe, as most of the universities, like mine, are city universities with buildings dispersed around the city.

During the conference, I gave an invited talk on “Data Science for Everyone” and showed how Orange can be used to teach basic data science concepts in a few hours so that the trainee can gain some intuition about what data science is and then, preferably, use the software on their own data. To prove this concept, I gave an example workshop during the next day of the conference. The workshop was also attended by several teachers that are thinking of incorporating Orange within their data science curricula.

Admittedly, there was not much GIS in my presentations, as I – as planned – focused more on data science. But I did include an example of how to project the data in Orange to geographical maps. The example involved the analysis of Human Development Index data and clustering. When projected to the map, the results of clustering could be unexpected if we select only the features that address quality of life: check out the map below and try to figure out what is wrong.

Here, I would like to thank Igor Ivan and Jiri Horak for the invitation, and their group and specifically Michal Kacmarik for the hospitality.

The Changing Status Bar

Every week on Friday, when the core team of Orange developers meets, we are designing new improvements of Orange’s graphical interface. This time, it was the status bar. Well, actually, it was the status bar quite a while ago and required the change of the core widget library, but it is materializing these days and you will see the changes in the next release.

Consider the Neighbors widget. The widget considers the input data and reference data items, and outputs instance form input data that are most similar to the references. Like, if the dolphin is a reference, we would like to know which are the three most similar animals. But this is not what want I wanted to write about. I would only like to say that we are making a slight change in the user interface. Below is the Neighbors widget in the current release of Orange, and the upcoming one.

See the difference? We are getting rid of the infobox on the top of the control tab, and moving it to the status bar. In the infobox widgets typically display what is in their input and what is on the output after the data has been processed. Moving this information to the status bar will make widgets more compact and less cluttered. We will similarly change the infoboxes in this way in all of the widgets.

Single-Cell Data Science for Everyone

Molecular biologists have in the past twenty years invented technologies that can collect abundant experimental data. One such technique is single-cell RNA-seq, which, very simplified, can measure the activity of genes in possibly large collections of cells. The interpretation of such data can tell us about the heterogeneity of cells, cell types, or provide information on their development.

Typical analysis toolboxes for single-cell data are available in R and Python and, most notably, include Seurat and scanpy, but they lack interactive visualizations and simplicity of Orange. Since the fall of 2017, we have been developing an extension of Orange, which is now (almost) ready. It has even been packed into its own installer. The first real test of the software was in early 2018 through a one day workshop at Janelia Research Campus. On March 6, and with a much more refined version of the software, we have now repeated the hands-on workshop at the University of Pavia.

The five-hour workshop covered both the essentials of data analysis and single cell analytics. The topics included data preprocessing, clustering, and two-dimensional embedding, as well as working with marker genes, differential expression analysis, and interpretation of clusters through gene ontology analysis.

I want to thank Prof. Dr. Riccardo Bellazzi and his team for the organization, and Erasmus program for financial support. I have been a frequent guest to Pavia, and learn something new every time I am there. Besides meeting new students and colleagues that attended the workshop and hearing about their data analysis challenges, this time I have also learned about a dish I had never had before in all my Italian travels. For one of the dinners (thank you, Michela) we had Pizzoccheri. Simply great!

The Mystery of Test & Score

Test & Score is surely one the most used widgets in Orange. Fun fact: it is the fourth in popularity, right after Data Table, File and Scatter Plot. So let us dive into the nuts and bolts of the Test & Score widget.

The widget generally accepts two inputs – Data and Learner. Data is the data set that we will be using for modeling, say, iris.tab that is already pre-loaded in the File widget. Learner is any kind of learning algorithm, for example, Logistic Regression. You can only use those learners that support your type of task. If you wish to do classification, you cannot use Linear Regression and for regression you cannot use Logistic Regression. Most other learners support both tasks. You can connect more than one learner to Test & Score.

Test & Score will now use each connected Learner and the Data to build a predictive model. Models can be built in different ways. The most typical procedure is cross validation, which splits the data into k folds and uses k – 1 folds for training and the remaining fold for testing. This procedure is repeated, so that each fold has been used for testing exactly once. Test & Score will then report on the average accuracy of the model.

You can also use Random Sampling, which will split the data into two sets with predefined proportions (e.g. 66% : 34%), build a model on the first set and test it on the second. This is similar to CV, except that each data instance can be used more than once for testing.

Leave one out is again very similar to the above two methods, but it only takes one data instance for testing each time. If you have a 100 data instances, then 99 will be used for training and 1 for testing, and the procedure will be repeated a 100 times until every data instance was used once for testing. As you can imagine, this is a very time-intensive procedure and it is recommended for smaller data sets only.

Test on train data uses the whole data set for training and again the same data for testing. Because of overfitting, this will usually overestimate the performance! Test on test data requires an additional data input (Test Data) and allows the user to control both data sets (training and testing) used for evaluation.

Finally, you can also use cross validation by feature. Sometimes, you would have pre-defined folds for a procedure, that you wish to replicate. Then you can use Cross validation by feature to ensure data instances are split into the same folds every time. Just make sure the feature you are using for defining folds is a categorical variable and located in meta attributes.

Another scenario is when you have several examples from the same object, for example several measurements of the same patient or several images of the same plant. Then you absolutely want to make sure that all data instances for a particular object are in the same fold. Otherwise, your model would probably report severely overfitted scores.

How to Abuse p-Values in Correlations

In a parallel universe, not so far from ours, Orange’s Correlation widget looks like this.

Quite similar to ours, except that this one shows p-values instead of correlation coefficients. Which is actually better, isn’t it? I mean, we have all attended Statistics 101, and we know that you can never trust correlation coefficients without looking at p-values to check that these correlations are real, right? So why on Earth doesn’t Orange show them?

First a side note. It was Christmas not long ago. Let’s call a ceasefire on the frequentist vs. Bayesian war. Let us, for Christ’s sake, pretend, pardon, agree that null-hypothesis testing is not wrong per se.

The mantra of null-hypothesis significance testing goes like this:

1. Form hypothesis.
2. Collect data.
3. Test hypothesis.

In contrast, the parallel-universe Correlation widget is (ab)used like this:

1. Collect data.
2. Test all possible hypotheses.
3. Cherry pick those that are confirmed.

This is like the Texas sharpshooter who fires first and then draws targets around the shots. You should never formulate hypothesis based on some data and then use this same data to prove it. Because it usually (surprise!) works.

Illustration by Dirk-Jan Hoek (CC-BY).

 

Back to the above snapshot. It shows correlations between 100 vegetables based on 100 different measurements (Ca and Mg content, their consumption in Finland, number of mentions in Star Trek DS9 series, likelihood of finding it on the Mars, and so forth). In other words, it’s all made it up. Just a 100×100 matrix of random numbers with column labels from the simple Wikipedia list of vegetables. Yet the similarity between mung bean and sunchokes surely cannot be dismissed (p < 0.001). Those who like bell pepper should try cilantro, too, because it’s basically one and the same thing (p = 0.001). And I honestly can’t tell black bean from wasabi (p = 0.001).

Here are the p-values for the top 100 most correlated pairs.

import numpy as np
import scipy as sp
a = np.random.random((100, 100))
sorted(stats.pearsonr(a[i], a[j])[1] for i in range(100) for j in range(i))[:100]
[0.0002774329730584203, 0.0004158786523819104, 0.0005008536192579852,
0.0007211022164265075, 0.0008268675086438253, 0.0010265740674904762,
(...91 values omitted to reduce the nonsense)
0.01844720610938738, 0.018465602922746942, 0.018662079618069056]

First 100 correlations are highly significant.

To learn a lesson we may have failed to grasp at the NHST 101 class, consider that there are 100 * 99 / 2 pairs. What is the significance of the pair at 5-th percentile?

correlations = sorted(stats.pearsonr(a[i], a[j])[1] for i in range(100) for j in range(i))
npairs = 100 * 99 / 2
print(correlations[int(pairs * 0.05)]
0.0496868751692227

Roughly 0.05. This is exactly what should have happened, because:

correlations[int(npairs * 0.10)]
0.10004180592217532
correlations[int(npairs * 0.15)]
0.15236602574520097
correlations[int(npairs * 0.30)]
0.3026816170584785

This proves only that p-values for the Pearson correlation coefficient are well calibrated (and that Mersenne twister that is used to generate random numbers in numpy works well). In theory, the p-value for a certain statistics (like Pearson’s r) is the probability of getting such or even more extreme value if the null-hypothesis (of no correlation, in our case) is actually true. 5 % of random hypotheses should have a p-value below 0.05, 10 % a value below 10, and 23 % a value below 23.

Imagine what they can do with the Correlations widget in the parallel universe! They compute correlations between all pairs, print out the first 5 % of them and start writing a paper without bothering to look at p-values at all. They know they should be statistically significant even if the data is random.

Which is precisely the reason why our widget must not compute p-values: because people would use it for Texas sharpshooting. P-values make sense only in the context of the proper NHST procedure (still pretending for the sake of Christmas ceasefire). They cannot be computed using the data on which they were found.

If so, why do we have the Correlation widget at all if it’s results are unpublishable? We can use it to find highly correlated pairs in a data sample. But we can’t just attach p-values to them and publish them. By finding these pairs (with assistance of Correlation widget) we just formulate hypotheses. This is only step 1 of the enshrined NHST procedure. We can’t skip the other two: the next step is to collect some new data (existing data won’t do!) and then use it to test the hypotheses (step 3).

Following this procedure doesn’t save us from data dredging. There are still plenty of ways to cheat. It is the most tempting to select the first 100 most correlated pairs (or, actually, any 100 pairs), (re)compute correlations on some new data and publish the top 5 % of these pairs. The official solution for this is a patchwork of various corrections for multiple hypotheses testing, but… Well, they don’t work, but we should say no more here. You know, Christmas ceasefire.

Scatter Plots: the Tour

Scatter plots are surely one of the best loved visualizations in Orange. Very often, when we teach, people go back to scatter plots over and over again to see their data. We took people’s love for scatter plots to the heart and we redesigned them a bit to make them even more friendly.

Our favorite still remains the Informative Projections button. This button helps you find interesting visualizations from all the combinations of your data variables. But what does interesting mean? Well, let us look at an example. Which of the two visualizations tells you more about the data?

We’d say it is the right one. Why? Because now we know that the combination of petal length and petal width nicely separates the classes!

Of course, Informative Projections button will only work when you have set a class (target) variable.

In scatter plot, you can set also the color of the data points (class is selected by default), the size of the points and the shape. This means you can add three new layers of information to your data, but we warn you not to overuse them. This usually looks very incomprehensible, even though it packs a lot of information.

You might notice, that in the current version of Orange, you can no longer select discrete attributes in Scatter Plot. This is entirely intentional. Scatter plots are best at showing the relationship between two numeric variables, such as in the two examples above. Categorical variables are much better represented with Box Plots, histograms (in Distributions) or in Mosaic Display.

   

Above, we have presented the same information for titanic data set in different visualizations, that are particularly suitable for categorical variables.

Scatter plot also enables so cool tricks. Just like in most visualizations in Orange, I can select a part of the data and observe the subset downstream. Or the other way around. I have a particular subset I wish to observe and I can pass it to Scatter Plot widget, which will highlight selected data instances.

This is also true for all other point-based visualizations in Orange, namely t-SNE, MDS, Radviz, Freeviz, and Linear Projection.

You can see there are many great thing you can do with Scatter Plot. Finally, we have added a nice touch to the visualization.

Yes, setting the size of the attribute is now animated! 🙂

Happy holidays, everyone!

Orange is Getting Smarter

In the past few months, Orange has been getting smarter and sleeker.

Since version 3.15.0, Orange remembers which distinct widgets users like to connect, adjusting the sorting on the widget search menu accordingly. Additionally, there is a new look for the Edit Links window coming soon.

Orange recently implemented a basic form of opt-in usage tracking, specifically targeting how users add widgets to the canvas.

Word cloud of widget popularity in Orange.

 

The information is collected anonymously for the users that opted-in. We will use this data to improve the widget suggestion system. Furthermore, the data provides us the first insight into how users interact with Orange. Let’s see what we’ve found out from the data recorded in the past few weeks.

 

There are four different ways of adding a widget to the canvas,

  • clicking it in the sidebar,
  • dragging it from the sidebar,
  • searching for it by right-clicking on canvas,
  • extending the workflow by dragging the communication channel from a widget.

 

A workflow extend action.

 

Among Orange users, the most popular way of adding a new widget is by dragging the communication line from the output widget – we think this is the most efficient way of using Orange too. However, the patterns vary among different widgets.

How users add widgets to canvas, from 20,775 add widget events.

 

Users tend to add root nodes such as File via a click or drag from the sidebar, while adding leaf nodes such as Data Table via extension from another widget.

How users add File to canvas.

How users add Data Table to canvas.

 

The widget popularity contest goes to: Data Table! Rightfully so, one should always check their data with Data Table.

Widget popularity visualization in Box Plot.

 

52% of sessions tracked consisted of no widgets being added (the application just being opened and closed). While some people might really like watching the loading screen, most of these are likely due to the fact that usage is not tracked until the user explicitly opts in.

 

Each bit of collected data comes at a cost to the privacy of the user. Care was put into minimizing the intrusiveness of data collection methods, while maximizing the usefulness of the collected data.

Initially, widget addition events were planned to include a ‘time since application start’ value, in order to be able to plot a user’s actions as a function of time. While this would be cool, it was ultimately decided that its usefulness is outweighed by the privacy cost to users.

 

For the keen, data is gathered per canvas session, in the following structure:

  • Date
  • Orange version
  • Operating system
  • Widget addition events, each entailing:
    • Widget name
    • Type of addition (Click, Drag, Search or Extend)
    • (Other widget name), if type is Extend
    • (Query), if type is Search or Extend

Data Mining for Anthropologists?

This weekend we were in Lisbon, Portugal, at the Why the World Needs Anthropologists conference, an event that focuses on applied anthropology, design, and how soft skills can greatly benefit the industry. I was there to hold a workshop on Data Ethnography, an approach that tries to combine methods from data science and anthropology into a fruitful interdisciplinary mix!

Data Ethnography workshop at this year’s Why the World Needs Anthropologists conference.

 

Data ethnography is a novel methodological approach that tries to view social phenomena from two different points of view – qualitative and quantitative. The quantitative approach is using data mining and machine learning methods on anthropological data (say from sensors, wearables, social media, online fora, field notes and so on) trying to find interesting patterns and novel information. The qualitative approach uses ethnography to substantiate the analytical findings with context, motivations, values, and other external data to provide a complete account of the studied phenomenon.

At the workshop, I presented a couple of approaches I use in my own research, namely text mining, clustering, visualization of patterns, image analytics, and predictive modeling. Data ethnography can be used, not only in its native field of computational anthropology, but also in museology, digital anthropology, medical anthropology, and folkloristics (the list is probably not exhaustive). There are so many options just waiting for the researchers to dig in!

Related: Text Analysis Workshop at Digital Humanities 2017

However, having data- and tech-savvy anthropologists does not only benefit the research, but opens a platform for discussing the ethics of data science, human relationships with technology, and overcoming model bias. Hopefully, the workshop inspired some of the participants to join me on a journey through the amazing expanses of data science.

To get you inspired, here are two contributions that present some option for computational anthropological research: Data Mining Workspace Sensors: A New Approach to Anthropology and Power of Algorithms for Cultural Heritage Classification: The Case of Slovenian Hayracks.

 

Orange Now Speaks 50 Languages

In the past couple of weeks we have been working hard on introducing a better language support for the Text add-on. Until recently, Orange supported only a limited number of languages, mostly English and some bigger languages, such as Spanish, German, Arabic, Russian… Language support was most evident in the list of stopwords, normalization and POS tagging.

Related: Text Workshops in Ljubljana

Stopwords come from NLTK library, so we can only offer whatever is available there. However, TF-IDF already implicitly considers stopwords, so the functionality is already implemented. For POS tagging, we would rely on Stanford POS tagger, that already has pre-trained models available.

The main issue was with normalization. While English can do without lemmatization and stemming for simple tasks, morphologically rich languages, such as Slovenian, perform poorly on un-normalized tokens. Cases and declensions present a problem for natural language processing, so we wanted to provide a tool for normalization in many different languages. Luckily, we found UDPipe, a Czech initiative that offers trained lemmatization models for 50 languages. UDPipe is actually a preprocessing pipeline and we are already thinking about how to bring all of its functionality to Orange, but let us talk a bit about the recent improvements for normalization.

Let us load a simple corpus from Corpus widget, say grimm-tales-selected.tab that contain 44 tales from the Grimm Brothers. Now, pass them through Preprocess Text and keep just the defaults, namely lowercase transformation, tokenization by words, and removal of stopwords. Here we see that we have came as quite a frequent word and come as a bit less frequent. But semantically, they are the same word from the verb to come. Shouldn’t we consider them as one word?

Results without applying normalization.

 

We can. This is what normalization does – it transforms all words into their lemmas or basic grammatical form. Came and come will become come, sons and son will become son, pretty and prettier will become pretty. This will result in less tokens that capture the text better, semantically speaking.

Results of UDPipe normalization.

 

We can see that came became come with 435 counts. Went became go. Said became say. And so on. As we said, this doesn’t work only on verbs, but on all word forms.

One thing to note here. UDPipe has an internal tokenizer, that works with sentences instead of tokens. You can enable it by selecting UDPipe tokenizer option. What is the difference? A quicker version would be to tokenize all the words and just look up their lemma. But sometimes this can be wrong. Consider the sentence:

I am wearing a tie to work.

Now the word tie is obviously a piece of clothing, which is indicated by the word wearing before it. But tie alone can also be the verb to tie. So the UDPipe tokenizer will consider the entire sentence and correctly lemmatize this word, while lemmatization on regular tokens might not. While UDPipe works better, it is also slower, so you might want to work with regular tokenization to speed up the analysis.

In Preprocess Text, you turn on the Normalization button on the right, then select UDPipe Lemmatizer and select the language you wish to use. Finally, if you wish to go with the better albeit slower UDPipe tokenizer, tick the UDPipe tokenizer box.

 

Finally, UDPipe does not remove punctuation, so you might end up with words like rose. and away., with the full stop at the end. This you can fix with using regular tokenization and also by select the Regex option in Filtering, which will remove pure punctuation.

Final workflow, where we compared the results of no normalization and UDPipe normalization in a word cloud.

 

This is it. UDPipe contains lemmatization models for 50 languages and only when you click on a particular language in the Language option, will the resource be loaded, so your computer won’t be flooded with models for languages you won’t ever use. The installation of UDPipe could also be a little tricky, but after some initial obstacles, we have managed to prepare packages for both pip (OSX and Linux) and conda (Windows).

We hope you enjoy the new possibilities of a freshly multilingual Orange!