Business Case Studies with Orange

Previous week Blaz, Robert and I visited Wärtsilä in the lovely Dolina near Trieste, Italy. Wärtsilä is one of the leading designers of lifecycle power solutions for the global marine and energy markets and its subsidiary in Trieste is one of the largest Wärtsilä Group engine production plants. We were there to hold a one-day workshop on data mining and machine learning with the aim to identify relevant use cases in business and show how to address them.

Related: Data Mining for Business and Public Administration

One such important use case is employee attrition. It is vital for any company to retain its most valuable workers, so they must learn how to identify dissatisfied employees and provide incentive from the to stay. It is easy to construct a workflow in Orange that helps us with this.

First, let us load Attrition – Train data set from the Datasets widget. This is a synthetic data set from IBM Watson that has 1470 instances (employees) and 18 features describing them. Our target variable is Attrition, where Yes means the person left the company and No means it stayed.

Now our goal is to construct a predictive model that will successfully predict the likelihood of a person leaving. Let us connect a couple of classifiers and the data set to Test and Score and see which model performs best.

Seems like Logistic Regression is the winner here, since its AUC score it the highest of the three.

A great thing about Logistic Regression is that it is interpretable. We can connect the data from Datasets to Logistic Regression and the resulting model from LR to Nomogram. Nomogram shows the top ten features, ranked by their contribution to the final probability of a class.

The length of a line corresponds to the relative importance of the attribute. Seems like recently hired employees are more likely to leave (YearsAtCompany goes towards 0). We also should consider promoting those that haven’t been promoted in a while (YearsSinceLastPromotion goes towards 15) and cut the overtime (OverTime is Yes). Model inspection helps us identify relevant attributes and interpret their values. So useful for HR departments!

Finally, we can take new data and predict the likelihood for leaving. Put another Datasets on the canvas and load Attrition – Predict data. This one contains only three instances – say the data for three employees we have forgotten to consider in our training data.

So who is more likely to leave? We obviously cannot afford to promote everyone, because this costs money. We need to optimize our decisions so that we both increase the satisfaction of employees while keep our costs low. This is where we can use predictive modeling. Connect Logistic Regression to Predictions widget. Then connect the second Datasets widget with the new data to Predictions as well.

Seems like John is most likely to leave. He has been at the company for only a year and he works overtime.

This is something HR department can work with to design proper policies and keep best talent. The same workflow can be used for churn prediction, process optimization and predicting success of a new product.

Data Mining for Business and Public Administration

We’ve been having a blast with recent Orange workshops. While Blaž was getting tanned in India, Anže and I went to the charming Liverpool to hold a session for business school professors on how to teach business with Orange.

Related: Orange in Kolkata, India

Obviously, when we say teach business, we mean how to do data mining for business, say predict churn or employee attrition, segment customers, find which items to recommend in an online store and track brand sentiment with text analysis.

For this purpose, we have made some updates to our Associate add-on and added a new data set to Data Sets widget which can be used for customer segmentation and discovering which item groups are frequently bought together. Like this:

We load the Online Retail data set.

Since we have transactions in rows and items in columns, we have to transpose the data table in order to compute distances between items (rows). We could also simply ask Distances widget to compute distances between columns instead of rows. Then we send the transposed data table to Distances and compute cosine distance between items (cosine distance will only tell us, which items are purchased together, disregarding the amount of items purchased).

Finally, we observe the discovered clusters in Hierarchical Clustering. Seems like mugs and decorative signs are frequently bought together. Why so? Select the group in Hierarchical Clustering and observe the cluster in a Data Table. Consider this an exercise in data exploration. 🙂

The second workshop was our standard Introduction to Data Mining for Ministry of Public Affairs.

Related: Analyzing Surveys

This group, similar to the one from India, was a pack of curious individuals who asked many interesting questions and were not shy to challenge us. How does a Tree know which attribute to split by? Is Tree better than Naive Bayes? Or is perhaps Logistic Regression better? How do we know which model works best? And finally, what is the mean of sauerkraut and beans? It has to be jota!

Workshops are always fun, when you have a curious set of individuals who demand answers! 🙂