Data Mining Course at Higher School of Economics, Moscow

Janez and I have recently returned from a two-week stay in Moscow, Russian Federation, where we were teaching data mining to MA students of Applied Statistics. This is a new Master’s course that attracts the best students from different backgrounds and teaches them statistical methods for work in the industry.

It was a real pleasure working at HSE. The students were proactive by asking questions and really challenged us to do our best.

One of the things we did was compute minimum cost of misclassifications. The story goes like this. Sara is a doctor and has data on 303 patients with heart disease (Orange’s heart-disease.tab data set). She used some classifiers and now has to decide how many patients to send for further tests. Naive Bayes classifier, for example, returned probabilities of a patient being sick (column Naive Bayes 1). For each threshold in probabilites, she will compute how many false positives (patients declared sick when healthy) and how many false negatives (patients declared healthy when sick) a classifiers returns. Each mistake is associated with a cost. Now she wants to find out, how many patients to send for tests (what probability threshold to choose) so that her cost is the lowest.

First, import all the libraries we will need:

import matplotlib.pyplot as plt
import numpy as np

from Orange.data import Table
from Orange.classification import NaiveBayesLearner, TreeLearner
from Orange.evaluation import CrossValidation

Then load heart disease data (and print a sample).

heart = Table("heart_disease")
print(heart[:5])

Now, train classifiers and select probabilities of Naive Bayes for a patient being sick.

scores = CrossValidation(heart, [NaiveBayesLearner(), TreeLearner()])

#take probabilites of class 1 (sick) of NaiveBayesLearner
p1 = scores.probabilities[0][:, 1]

#take actual class values
y = scores.actual

#cost of false positive (patient classified as sick when healthy)
fp_cost = 500

#cost of false negative (patient classified as healthy when sick)
fn_cost = 800

Set counts, where we declare 0 patients being sick (threshold >1).

fp = 0
#start with threshold above 1 (no one is sick)
fn = np.sum(y)

For each threshold, compute the cost associated with each type of mistake.

ps = []
costs = []

#compute costs of classifying i patients as sick
for i in np.argsort(p1)[::-1]:
    if y[i] == 0:
        fp += 1
    else:
        fn -= 1
    ps.append(p1[i])
    costs.append(fp * fp_cost + fn * fn_cost)

In the end, we get a list of probability thresholds and associated costs. Now let us find the minimum cost and its probability of a patient being sick.

costs = np.array(costs)
#find probability of a patient being sick at lowest cost
print(ps[costs.argmin()])

This means the threshold that minimizes our cost for a given classifier is 0.620655. Sara would send all the patients with a probability of being sick higher or equal than 0.620655  for further tests.

At the end, we can also plot the cost to patients sent curve.

fig, ax = plt.subplots()
plt.plot(ps, costs)
ax.set_xlabel('Patients sent')
ax.set_ylabel('Cost')

You can download the IPython Notebook here: Minimum Cost.

Understanding Voting Patterns at AKOS Workshop

Two days ago we held another Introduction to Data Mining workshop at our faculty. This time the target audience was a group of public sector professionals and our challenge was finding the right data set to explain key data mining concepts. Iris is fun, but not everyone is a biologist, right? Fortunately, we found this really nice data set with ballot counts from the Slovenian National Assembly (thanks to Parlameter).

Related: Intro to Data Mining for Life Scientists

Workshop for the Agency for Communication Networks and Services (AKOS).

 

The data contains ballot counts, statistics, and description for 84 members of the parliament (MPs). First, we inspected the data in a Data Table. Each MP is described with 14 meta features and has 18 ballot counts recorded.

Out data has 84 instances, 18 features (ballot counts) and 14 meta features (MP description).

 

We have some numerical features, which means we can also inspect the data in Scatter Plot. We will plot MPs’ attendance vs. the number of their initiatives. Quite interesting! There is a big group of MPs who regularly attend the sessions, but rarely propose changes. Could this be the coalition?

Scatter plot of MPs’ session attendance (in percentage) and the number of initiatives. Already an interesting pattern emerges.

 

The next question that springs to our mind is – can we discover interesting voting patterns from our data? Let us see. We first explored the data in Hierarchical Clustering. Looks like there are some nice clusters in our data! The blue cluster is the coalition, red the SDS party and green the rest (both from the opposition).

Related: Hierarchical Clustering: A Simple Explanation

Hierarchical Clustering visualizes a hierarchy of clusters. But it is hard to observe similarity of pairs of data instances. How similar are Luka Mesec and Branko Grims? It is hard to tell…

 

But it is hard to inspect so many data instances in a dendrogram. For example, we have no idea how similar are the voting records of Eva Irgl and Alenka Bratušek. Surely, there must be a better way to explore similarities and perhaps verify that voting patterns exist at even a party-level… Let us try MDS. MDS transforms multidimensional data into a 2D projection so that similar data instances lie close to each other.

MDS can plot a multidimensional data in 2D so that similar data points lie close to each other. But sometimes this optimization is hard. This is why we have grey lines connecting the dots – the dots connected are similar at the selected cut-off level (Show similar pairs slider).

 

Ah, this is nice! We even colored data points by the party. MDS beautifully shows the coalition (blue dots) and the opposition (all other colors). Even parties are clustered together. But there are some outliers. Let us inspect Matej Tonin, who is quite far away from his orange group. Seems like he was missing at the last two sessions and did not vote. Hence his voting is treated differently.

Data Table is a handy tool for instant data inspection. It is always great to check, what is on the output of each widget.

 

It is always great to inspect discovered groups and outliers. This way an expert can interpret the clusters and also explain, what outliers mean. Sometimes it is simply a matter of data (missing values), but sometimes we could find shifting alliances. Perhaps an outlier could be an MP about to switch to another party.

The final workflow.

 

You can have fun with these data, too. Let us know if you discover something interesting!

 

Top 100 Changemakers in Central and Eastern Europe

Recently Orange and one of its inventors, Blaž Zupan, have been recognized as one of the top 100 changemakers in the region. A 2016 New Europe 100 is an annual list of innovators and entrepreneurs in Central and Eastern Europe highlighting novel approaches to pressing problems.

Orange has been recognized for making data more approachable, which has been our goal from the get-go. The tool is continually being developed with the end user in mind – someone who wants to analyze his/her data quickly, visually, interactively, and efficiently. We’re always thinking hard how to expose valuable information in the data, how to improve the user experience, which defaults are the most appropriate for the method, and, finally, how to intuitively teach people about data mining.

This nomination is a great validation of our efforts and it only makes us work harder. Because every research should be fruitful and fun!

adv_data_mining-02