Data Mining Course at Higher School of Economics, Moscow

Janez and I have recently returned from a two-week stay in Moscow, Russian Federation, where we were teaching data mining to MA students of Applied Statistics. This is a new Master’s course that attracts the best students from different backgrounds and teaches them statistical methods for work in the industry.

It was a real pleasure working at HSE. The students were proactive by asking questions and really challenged us to do our best.

One of the things we did was compute minimum cost of misclassifications. The story goes like this. Sara is a doctor and has data on 303 patients with heart disease (Orange’s heart-disease.tab data set). She used some classifiers and now has to decide how many patients to send for further tests. Naive Bayes classifier, for example, returned probabilities of a patient being sick (column Naive Bayes 1). For each threshold in probabilites, she will compute how many false positives (patients declared sick when healthy) and how many false negatives (patients declared healthy when sick) a classifiers returns. Each mistake is associated with a cost. Now she wants to find out, how many patients to send for tests (what probability threshold to choose) so that her cost is the lowest.

First, import all the libraries we will need:

import matplotlib.pyplot as plt
import numpy as np

from Orange.data import Table
from Orange.classification import NaiveBayesLearner, TreeLearner
from Orange.evaluation import CrossValidation

Then load heart disease data (and print a sample).

heart = Table("heart_disease")
print(heart[:5])

Now, train classifiers and select probabilities of Naive Bayes for a patient being sick.

scores = CrossValidation(heart, [NaiveBayesLearner(), TreeLearner()])

#take probabilites of class 1 (sick) of NaiveBayesLearner
p1 = scores.probabilities[0][:, 1]

#take actual class values
y = scores.actual

#cost of false positive (patient classified as sick when healthy)
fp_cost = 500

#cost of false negative (patient classified as healthy when sick)
fn_cost = 800

Set counts, where we declare 0 patients being sick (threshold >1).

fp = 0
#start with threshold above 1 (no one is sick)
fn = np.sum(y)

For each threshold, compute the cost associated with each type of mistake.

ps = []
costs = []

#compute costs of classifying i patients as sick
for i in np.argsort(p1)[::-1]:
    if y[i] == 0:
        fp += 1
    else:
        fn -= 1
    ps.append(p1[i])
    costs.append(fp * fp_cost + fn * fn_cost)

In the end, we get a list of probability thresholds and associated costs. Now let us find the minimum cost and its probability of a patient being sick.

costs = np.array(costs)
#find probability of a patient being sick at lowest cost
print(ps[costs.argmin()])

This means the threshold that minimizes our cost for a given classifier is 0.620655. Sara would send all the patients with a probability of being sick higher or equal than 0.620655  for further tests.

At the end, we can also plot the cost to patients sent curve.

fig, ax = plt.subplots()
plt.plot(ps, costs)
ax.set_xlabel('Patients sent')
ax.set_ylabel('Cost')

You can download the IPython Notebook here: Minimum Cost.

Installing Add-ons Works Again

Dear Orange users,

Some of you might have an issue installing add-ons with the following issue popping up:

xmlrpc.client.Fault: <Fault -32601: 'server error; requested method not found'>

This is the result of the migration to a new infrastructure at PyPi, which provides the installation of add-ons. Our team has rallied to adjust the add-on installer so it works with the new and improved service.

In order to make the add-on installer work (again), please download the latest version of Orange (3.13.0).

We apologize for any inconvenience and wish you a fruitful data analysis in the future.

Yours truly,

Orange team

Unfreezing Orange

Have you ever tried Orange with data big enough that some widgets ran for more than a second? Then you have seen it: Orange froze. While the widget was processing, the interface would not respond to any inputs, and there was no way to stop that widget.

Not all the widgets freeze, though! Some widgets, like Test & Score, k-Means, or Image Embedding, do not block. While they are working, we are free to build other parts of the workflow, and these widgets also show their progress. Some, like Image Embedding, which work with lots of images, even allow interruptions.

Why does Orange freeze? Most widgets process users’ actions directly: after an event (click, pressed key, new input data) some code starts running: until it finishes, the interface can not respond to any new events. This is a reasonable approach for short tasks, such as making a selection in a Scatter Plot. But with longer tasks, such as building a Support Vector Model on big data, Orange gets unresponsive.

To make Orange responsive while it is processing, we need to start the task in a new thread. As programmers we have to consider the following:
1. Starting the task. We have to make sure that other (older) tasks are not running.
2. Showing results when the task has finished.
3. Periodic communication between the task and the interface for status reports (progress bars) and task stopping.

Starting the task and showing the results are straightforward and well documented in a tutorial for writing widgets. Periodic communication with stopping is harder: it is completely task-dependent and can be either trivial, hard, or even impossible. Periodic communication is, in principle, unessential for responsiveness, but if we do not implement it, we will be unable to stop the running task and progress bars would not work either.

Taking care of periodic communication was the hardest part of making the Neural Network widget responsive. It would have been easy, had we implemented neural networks ourselves. But we use the scikit-learn implementation, which does not expose an option to make additional function calls while fitting the network (we need to run code that communicates with the interface). We had to resort to a trick: we modified fitting so that a change to an attribute called n_iters_ called a function (see pull request). Not the cleanest solution, but it seems to work.

For now, only a few widgets work so that the interface remains responsive. We are still searching for the best way to make existing widgets behave nicely, but responsiveness is now one of our priorities.

Orange with Spectroscopy Add-on Workshop

We have just concluded our enhanced Introduction to Data Science workshop, which included several workflows for spectroscopy analysis. Spectroscopy add-on is intended for the analysis of spectral data and it is just as fun as our other add-ons (if not more!).

We will prove it with a simple classification workflow. First, install Spectroscopy add-on from Options – Add-ons menu in Orange. Restart Orange for the add-on to appear. Great, you are ready for some spectral analysis!

Use Datasets widget and load Collagen spectroscopy data. This data contains cells measured with FTIR and annotated with the major chemical compound at the imaged part of a cell. A quick glance in a Data Table will give us an idea how the data looks like. Seems like a very standard spectral data set.

Collagen data set from Datasets widget.

 

Now we want to determine, whether we can classify cells by type based on their spectral profiles. First, connect Datasets to Test & Score. We will use 10-fold cross-validation to score the performance of our model. Next, we will add Logistic Regression to model the data. One final thing. Spectral data often needs some preprocessing. Let us perform a simple preprocessing step by applying Cut (keep) filter and retaining only the wave numbers between 1500 and 1800. When we connect it to Test & Score, we need to keep in mind to connect the Preprocessor output of Preprocess Spectra.

Preprocessor that keeps a part of the spectra cut between 1500 and 1800. No data is shown here, since we are using only the preprocessing procedure as the input for Test & Score.

 

Let us see how well our model performs. Not bad. A 0.99 AUC score. Seems like it is almost perfect. But is it really so?

10-fold cross-validation on spectral data. Our AUC and CA scores are quite impressive.

 

Confusion Matrix gives us a detailed picture. Our model fails almost exclusively on DNA cell type. Interesting.

Confusion Matrix shows DNA is most often misclassified. By selecting the misclassified instances in the matrix, we can inspect why Logistic Regression couldn’t model these spectra

 

We will select the misclassified DNA cells and feed them to Spectra to inspect what went wrong. Instead of coloring by type, we will color by prediction from Logistic Regression. Can you find out why these spectra were classified incorrectly?

Misclassified DNA spectra colored by the prediction made by Logistic Regression.

 

This is one of the simplest examples with spectral data. It is basically the same procedure as with standard data – data is fed as data, learner (LR) as learner and preprocessor as preprocessor directly to Test & Score to avoid overfitting. Play around with Spectroscopy add-on and let us know what you think! 🙂

Single cell analytics workshop at HHMI | Janelia

HHMI | Janelia is one of the prettiest researcher campuses I have ever visited. Located in Ashburn, VA, about 20 minutes from Washington Dulles airport, it is conveniently located yet, in a way, secluded from the buzz of the capital. We adored the guest house with a view of the lake, tasty Janelia-style breakfast (hash-browns with two eggs and sausage, plus a bagel with cream cheese) in the on-campus pub, beautifully-designed interiors to foster collaborations and interactions, and late-evening discussions in the in-house pub.

All these thanks to the invitation of Andrew Lemire, a manager of a shared high-throughput genomics resource, and Dr. Vilas Menon, a mathematician specializing in quantitative genomics. With Andy and Vilas, we have been collaborating in the past few months on trying to devise a simple and intuitive tool for analysis of single-cell gene expression data. Single cell high-throughput technology is one of the latest approaches that allow us to see what is happening within a single cell, and it does that by simultaneously scanning through potentially thousands of cells. That generates loads of data, and apparently, we have been trying to fit Orange for single-cell data analysis task.

Namely, in the past half a year, we have been perfecting an add-on for Orange with components for single-cell analysis. This endeavor became so vital that we have even designed a new installation of Orange, called scOrange. With everything still in prototype stage, we had enough courage to present the tool at Janelia, first through a seminar, and the next day within a five-hour lecture that I gave together with Martin Strazar, a PhD student and bioinformatics expert from my lab. Many labs are embarking on single cell technology at Janelia, and by the crowd that gathered at both events, it looks like that everyone was there.

Orange, or rather, scOrange, worked as expected, and hands-on workshop was smooth, despite testing the software on some rather large data sets. Our Orange add-on for single-cell analytics is still in early stage of development, but already has some advanced features like biomarker discovery and tools for characterization of cell clusters that may help in revealing hidden relations between genes and phenotypes. Thanks to Andy and Vilas, Janelia proved an excellent proving ground for scOrange, and we are looking forward to our next hands-on single-cell analytics workshop in Houston.

Related: Hands-On Data Mining Course in Houston

How to enable SQL widget in Orange

A lot of you have been interested in enabling SQL widget in Orange, especially regarding the installation of a psycopg backend that makes the widget actually work. This post will be slightly more technical, but I will try to keep it to a minimum. Scroll to the bottom for installation instructions.

Related: SQL for Orange

Why won’t Orange recognize psycopg?

The main issue for some people was that despite having installed the psycopg module in their console, the SQL widget still didn’t work. This is because Orange uses a separate virtual environment and most of you installed psycopg in the default (system) Python environment. For psycopg to be recognized in Orange, it needs to be installed in the same virtual environment, which is normally located in C:\Users\<usr>\Anaconda3\envs\orange3 (on Windows). For the installation to work, you’d have to run it with the proper pip, namely:

C:\Users\<usr>\Anaconda3\envs\orange3\Scripts\pip.exe install psycopg2

Installation instructions

But there is a much easier way to do it. Head over to psycopg’s pip website and download the latest wheel for your platform. Py version has to be cp34 or higher (latest Orange from Anaconda comes with Python 3.6, so look for cp36).

For OSX, you would for example need: psycopg2-2.7.4-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl

For 64-bit Windows: psycopg2-2.7.4-cp36-cp36m-win_amd64.whl

And for Linux: psycopg2-2.7.4-cp36-cp36m-manylinux1_x86_64.whl

Then open the add-on dialog in Orange (Options –> Add-ons) and drag and drop the downloaded wheel into the add-on list. At the bottom, you will see psycopg2 with the tick next to it.

Click OK to run the installation. Then re-start Orange and connect to your database with SQL widget. If you have any questions, drop them in the comment section!

Image Analytics Workshop at AIUCD 2018

This week, Primož and I flew to the south of Italy to hold a workshop on Image Analytics through Data Mining at AIUCD 2018 conference. The workshop was intended to familiarize digital humanities researchers with options that visual programming environments offer for image analysis.

In about 5 hours we discussed image embedding, clustering, finding closest neighbors and classification of images. While it is often a challenge to explain complex concepts in such a short time, it is much easier when working with Orange.

Related: Image Analytics: Clustering

One of the workflows we learned at the workshop was the one for finding the most similar image in a set of images. This is better explained with an example.

We had 15 paintings from different authors. Two of them were painted by Claude Monet, a famous French impressionist painter. Our task was, given a reference image of Monet, to find his other painting in a collection.

A collection of images. It includes two Monet paintings.

First, we loaded our data set with Import Images. Then we sent our images to Image Embedding. We selected Painters embedder since it was specifically trained to recognize authors of paintings.

We used Painters embedder here.

Once we have described our paintings with vectors (embeddings), we can compare them by similarity. To find the second Monet in a data set, we will have to compute the similarity of paintings and find the one most similar one to our reference painting.

Related: Video on image clustering

Let us connect Image Embedding to Neighbors from Prototypes add-on. Neighbors widget is specifically intended to find a number of closest neighbors given a reference data point.

We will need to adjust the widget a bit. First, we will need cosine distance, since we will be comparing images by the content, not the magnitude of features. Next, we will tick off Exclude reference, in order to receive the reference image on the output. We do this just for visualization purposes. Finally, we set the number of neighbors to 2. Again, this is just for a nicer visualization, since we know there are only two Monet’s paintings in the data set.

Neighbors was set to provide a nice visualization. Hence we ticked off Exclude references and set Neighbors to 2.

Then we need to give Neighbors a reference image, for which we want to retrieve the neighbors. We do this by adding Data Table to Image Embedding, selecting one of Monet’s paintings in the spreadsheet and then connecting the Data Table to Neighbors. The widget will automatically consider the second input as a reference.

Monet.jpg is our reference painting. We select it in Data Table.

Now, all we need to do is to visualize the output. Connect Image Viewer to Neighbors and open it.

Voila! The widget has indeed found the second Monet’s painting. So useful when you have thousands of images in your archive!

Visualizing multiple variables: FreeViz

Scatter plots are great! But sometimes, we need to plot more than two variables to truly understand the data. How can we achieve this, knowing humans can only grasp up to three dimensions? With an optimization of linear projection, of course!

Orange recently re-introduced FreeViz, an interactive visualization for plotting multiple variables on a 2-D plane.

Let’s load zoo.tab data with File widget and connect FreeViz to it. Zoo data has 16 features describing animals of different types – mammals, amphibians, insects and so on. We would like to use FreeViz to show us informative features and create a visualization that separates well between animal types.

FreeViz with initial, un-optimized plot.

We start with un-optimized projection, where data points are scattered around features axes. Once we click Optimize, we can observe optimization process in real-time and at the end see the optimized projection.

FreeViz with optimized projection.

This projection is much more informative. Mammals are nicely grouped together within a pink cluster that is characterized by hair, milk, and toothed features. Conversely, birds are charaterized by eggs, feathers and airborne, while fish are aquatic. Results are as expected, which means optimization indeed found informative features for each class value.

FreeViz with Show class density option.

Since we are working with categorical class values, we can tick Show class density to color the plot by majority class values. We can also move anchors around to see how data points change in relation to a selected anchor.

Finally, as in most Orange visualizations, we can select a subset of data points and explore them further. For example, let us observe which amphibians are characterized by being aquatic in a Data Table. A newt, a toad and two types of frogs, one venomous and one not.

Data exploration is always much easier with clever visualizations!

Stack Everything!

We all know that sometimes many is better than few. Therefore we are happy to introduce the Stack widget. It is available in Prototypes add-on for now.

Stacking enables you to combine several trained models into one meta model and use it in Test&Score just like any other model. This comes in handy with complex problems, where one classifier might fail, but many could come up with something that works. Let’s see an example.

We start with something as complex as this. We used Paint Data to create a complex data set, where classes somewhat overlap. This is naturally an artificial example, but you can try the same on your own, real life data.

We used 4 classes and painted a complex, 2-dimensional data set.

 

Then we add several kNN models with different parameters, say 5, 10 and 15 neighbors. We connect them to Test&Score and use cross validation to evaluate their performance. Not bad, but can we do even better?

Scores without staking, using only 3 different kNN classifiers.

 

Let us try stacking. We will connect all three classifiers to the Stacking widget and use Logistic Regression as an aggregate, a method that aggregates the three models into a single meta model. Then we connect connect the stacked model into Test&Score and see whether our scores improved.

Scores with stacking. Stack reports on improved performance.

 

And indeed they have. It might not be anything dramatic, but in real life, say medical context, even small improvements count. Now go and try the procedure on your own data. In Orange, this requires only a couple of minutes.

Final workflow with channel names. Notice that Logistic Regression is used as Aggregate, not a Learner.