Girls Go Data Mining

This week we held our first Girls Go Data Mining workshop. The workshop brought together curious women and intuitively introduced them to essential data mining and machine learning concepts. Of course, we used Orange to explore visualizations, build predictive models, perform clustering and dive into text analysis. The workshop was supported by NumFocus through their small development grant initiative and we hope to repeat it next year with even more ladies attending!

Related: Text Analysis for Social Scientists

In two days, we covered many topics. On day one, we got to know Orange and the concept of visual programming, where the user construct analytical workflow by stacking visual components. Then we got to know several useful visualizations, such as box plot, scatter plot, distributions, and mosaic display, which give us an initial overview of the data and the potentially interesting patterns. Finally, we got our hands dirty with predictive modeling. We learnt about decision trees, logistic regression, and naive Bayes classifiers, and observed the models in tree viewer and nomogram. It is great having interpretable models and we had great fun exploring what is in the model!

On the second day, we tried to uncover groups in our data with clustering. First, we tried hierarchical clustering and explored the discovered clusters with box plot. Then we also tried k-means and learnt, why this method is better than hierarchical clustering. In the final part, we talked about the methods for text mining, how to do preprocessing, construct a bag of words and perform the machine learning on corpora. We used both clustering and classification and tried to find interesting information about Grimm tales.

One of our workflows, where we explored the data in many different ways, including inspecting misclassifications in a scatter plot!

 

One thing that always comes up as really useful in our workshops is Orange’s ability to output different types of data. For example, in Hierarchical Clustering, we can select the similarity cutoff at the top and output clusters. Our data table will have an additional column Cluster, with cluster labels for each data instance.

 

Hierarchial Clustering outputs data with an additional Cluster column.

 

We can explore clusters by connecting a Box Plot to Hierarchical Clustering, selecting Cluster in Subgroups and using Order by relevance option. This sorts the variables in Box Plot by how well they separate between clusters or, in other words, what is typical of each cluster.

We have selected Cluster in Subgroups section and ticked ‘Order by relevance’ to sort the variables. Variables at the top are the most interesting ones. Looks like giving milk is an exclusive property of cluster C1.

 

We used zoo.tab and made the cutoff at three clusters. It looks like the first cluster gives milk. Could these be a cluster of mammals?

We said giving milk is a property of cluster C1. By selecting type as our variable, we can see that C1 is a cluster of mammals.

 

Indeed it is!

Another option is to select a specific cluster in the dendrogram. Then, we have to rewire the connection between Hierarchical Clustering and Box Plot by setting it to Data. Data option will output the entire data set, with an extra column showing whether the data instance was selected or not. In our case, there would be a Yes if the instance is in the selected cluster and No if it is not.

To rewire the connection, double-click on it and drag a line from Data to Data.

 

We have selected one cluster in the dendrogram, rewired the connection to transmit Data (instead of Selected Data) and observed the results in a Data Table. We see an additional Selected column, which shows whether a data instance was selected in the visualization or not.

 

Then we can use Box Plot to observe what is particular for our selected cluster.

In this Box Plot we have used Selected in the Subgroups section and kept ‘Order by relevance’ on. The suggested distinctive feature of our selected cluster is having feathers.

 

It looks like animals from our selected cluster have feathers. Probably, this is a cluster of birds. We can check this with the same procedure as above.

In summary, most Orange visualizations have two outputs – Selected Data and Data. Selected Data will output a subset of data instances selected in the visualization (or selected clusters in the case of hierarchical clustering), while Data will output the entire data table with a column defining whether a data instance was selected or not. This is very useful if we want to inspect what is typical of an interesting group in our data, inspect clusters or even manually define groups.

Overall, this was another interesting workshop and we hope to continue our fruitful partnership with NumFocus and keep offering free educational events for beginners and experts alike!

From Surveys to Orange

Today we have finished a series of workshops for the Ministry of Public Affairs. This was a year-long cooperation and we had many students asking many different questions. There was however one that we talked about a lot. If I have a survey, how do I get it into Orange?

Related: Analyzing Surveys

We are using EnKlik Anketa service, which is a great Slovenian product offering a wide array of options for the creation of surveys. We have created one such simple survey to use as a test. I am now inside EnKlik Anketa online service and I can see my survey has been successfully filled out.

Now I have to create a public link to my survey in order to access the data in Orange. I have to click on an icon in the top right part and select ‘Public link’.

A new window opens, where I select ‘Add new public link’. This will generate a public connection to my survey results. But be careful, the type of the connection needs to be Data, not Analysis! Orange can’t read already analyzed data, it needs raw data from Data pane.

Now, all I have to do is open Orange, place EnKlik Anketa widget from the Prototypes add-on onto the canvas, enter the public link into the ‘Public link URL’ fields and press Enter. If your data has loaded successfully, the widget will display available variables and information in the Info pane.

From here on you can continue your analysis just like you would with any other data source!

Spectroscopy Workshop at BioSpec and How to Merge Data

Last week Marko and I visited the land of the midnight sun – Norway! We held a two-day workshop on spectroscopy data analysis in Orange at the Norwegian University of Life Sciences. The students from BioSpec lab were yet again incredible and we really dug deep into Orange.

Related: Orange with Spectroscopy Add-on

A class full of dedicated scientists.

 

One thing we did was see how to join data from two different sources. It would often happen that you have measurements in one file and the labels in the other. Or in our case, we wanted to add images to our zoo.tab data. First, find the zoo.tab in the File widget under Browse documentation datasets. Observe the data in the Data Table.

Original zoo data set.

 

This data contains 101 animal described with 16 different features (hair, aquatic, eggs, etc.), a name and a type. Now we will manually create the second table in Excel. The first column will contain the names of the animals as they appear in the original file. The second column will contain links to images of animals. Open your favorite browser and find a couple of images corresponding to selected animals. Then add links to images below the image column. Just like that:

Extra data that we want to add to the original data.

 

Remember, you need a three-row header to define the column that contains images. Under the image column add string in the second and type=image in the third row. This will tell Orange where to look for images. Now, we can check our animals in Image Viewer.

A quick glance at an Image Viewer will tell us whether our images got loaded correctly.

 

Finally, it is time to bring in the images to the existing zoo data set. Connect the original File to Merge Data. Then add the second file with animal images to Merge Data. The default merging method will take the first data input as original data and the second data as extra data. The column to match by is defined in the widget. In our case, it is the name column. This means Orange will look at the first name column and find matching instances in the second name column.

 

A quick look at the merged data shows us an additional image column that we appended to the original file.

Merged data with a new column.

 

This is the final workflow. Merge Data now contains a single data table on the output and you can continue your analysis from there.

Find out more about spectroscopy for Orange on our YouTube channel or contribute to the project on Github.

Python Script: Managing Data on the Fly

Python Script is this mysterious widget most people don’t know how to use, even those versed in Python. Python Script is the widget that supplements Orange functionalities with (almost) everything that Python can offer. And it’s time we unveil some of its functionalities with a simple example.

Example: Batch Transform the Data

There might be a time when you need to apply a function to all your attributes. Say you wish to log-transform their values, as it is common in gene expression data. In theory, you could do this with Feature Constructor, where you would log-transform every attribute individually. Sounds laborious? It’s because it is. Why else we have computers if not to reduce manual labor for certain tasks? Let’s do it the fast way – with Python Script.

First, open File widget and load geo-gds360.tab from Browse documentation data sets. This data set has 9485 features, so imagine having to transform each feature individually.

Instead, we will connect Python Script to File and use a simple script to apply the same transformation to all attributes.

import numpy as np
from Orange.data import Table

new_X = np.log(in_data.X)
out_data = Table(in_data.domain, new_X, in_data.Y, in_data.metas)

This is really simple. Use in_data.X, which accesses all features in the data set, to transform the data with np.log (or any other numpy function). Set out_data to new_X and, voila, the transformed data is on the output. In a few lines we have instantly handled all 9485 features.

You can inspect the data before and after transformation in a Data Table widget.

Original data.
Log-transformed data.

 

This is it. Now we can do our standard analysis on the transformed data. Even better! We can save our script and use it in Python Script widget any time we want.

For your convenience I have already added the Log Attributes Script, so you can download and use it instantly!

Have a more interesting example with Python Script? We’d love to hear about it!

Data Mining Course at Higher School of Economics, Moscow

Janez and I have recently returned from a two-week stay in Moscow, Russian Federation, where we were teaching data mining to MA students of Applied Statistics. This is a new Master’s course that attracts the best students from different backgrounds and teaches them statistical methods for work in the industry.

It was a real pleasure working at HSE. The students were proactive by asking questions and really challenged us to do our best.

One of the things we did was compute minimum cost of misclassifications. The story goes like this. Sara is a doctor and has data on 303 patients with heart disease (Orange’s heart-disease.tab data set). She used some classifiers and now has to decide how many patients to send for further tests. Naive Bayes classifier, for example, returned probabilities of a patient being sick (column Naive Bayes 1). For each threshold in probabilites, she will compute how many false positives (patients declared sick when healthy) and how many false negatives (patients declared healthy when sick) a classifiers returns. Each mistake is associated with a cost. Now she wants to find out, how many patients to send for tests (what probability threshold to choose) so that her cost is the lowest.

First, import all the libraries we will need:

import matplotlib.pyplot as plt
import numpy as np

from Orange.data import Table
from Orange.classification import NaiveBayesLearner, TreeLearner
from Orange.evaluation import CrossValidation

Then load heart disease data (and print a sample).

heart = Table("heart_disease")
print(heart[:5])

Now, train classifiers and select probabilities of Naive Bayes for a patient being sick.

scores = CrossValidation(heart, [NaiveBayesLearner(), TreeLearner()])

#take probabilites of class 1 (sick) of NaiveBayesLearner
p1 = scores.probabilities[0][:, 1]

#take actual class values
y = scores.actual

#cost of false positive (patient classified as sick when healthy)
fp_cost = 500

#cost of false negative (patient classified as healthy when sick)
fn_cost = 800

Set counts, where we declare 0 patients being sick (threshold >1).

fp = 0
#start with threshold above 1 (no one is sick)
fn = np.sum(y)

For each threshold, compute the cost associated with each type of mistake.

ps = []
costs = []

#compute costs of classifying i patients as sick
for i in np.argsort(p1)[::-1]:
    if y[i] == 0:
        fp += 1
    else:
        fn -= 1
    ps.append(p1[i])
    costs.append(fp * fp_cost + fn * fn_cost)

In the end, we get a list of probability thresholds and associated costs. Now let us find the minimum cost and its probability of a patient being sick.

costs = np.array(costs)
#find probability of a patient being sick at lowest cost
print(ps[costs.argmin()])

This means the threshold that minimizes our cost for a given classifier is 0.620655. Sara would send all the patients with a probability of being sick higher or equal than 0.620655  for further tests.

At the end, we can also plot the cost to patients sent curve.

fig, ax = plt.subplots()
plt.plot(ps, costs)
ax.set_xlabel('Patients sent')
ax.set_ylabel('Cost')

You can download the IPython Notebook here: Minimum Cost.

Installing Add-ons Works Again

Dear Orange users,

Some of you might have an issue installing add-ons with the following issue popping up:

xmlrpc.client.Fault: <Fault -32601: 'server error; requested method not found'>

This is the result of the migration to a new infrastructure at PyPi, which provides the installation of add-ons. Our team has rallied to adjust the add-on installer so it works with the new and improved service.

In order to make the add-on installer work (again), please download the latest version of Orange (3.13.0).

We apologize for any inconvenience and wish you a fruitful data analysis in the future.

Yours truly,

Orange team

Unfreezing Orange

Have you ever tried Orange with data big enough that some widgets ran for more than a second? Then you have seen it: Orange froze. While the widget was processing, the interface would not respond to any inputs, and there was no way to stop that widget.

Not all the widgets freeze, though! Some widgets, like Test & Score, k-Means, or Image Embedding, do not block. While they are working, we are free to build other parts of the workflow, and these widgets also show their progress. Some, like Image Embedding, which work with lots of images, even allow interruptions.

Why does Orange freeze? Most widgets process users’ actions directly: after an event (click, pressed key, new input data) some code starts running: until it finishes, the interface can not respond to any new events. This is a reasonable approach for short tasks, such as making a selection in a Scatter Plot. But with longer tasks, such as building a Support Vector Model on big data, Orange gets unresponsive.

To make Orange responsive while it is processing, we need to start the task in a new thread. As programmers we have to consider the following:
1. Starting the task. We have to make sure that other (older) tasks are not running.
2. Showing results when the task has finished.
3. Periodic communication between the task and the interface for status reports (progress bars) and task stopping.

Starting the task and showing the results are straightforward and well documented in a tutorial for writing widgets. Periodic communication with stopping is harder: it is completely task-dependent and can be either trivial, hard, or even impossible. Periodic communication is, in principle, unessential for responsiveness, but if we do not implement it, we will be unable to stop the running task and progress bars would not work either.

Taking care of periodic communication was the hardest part of making the Neural Network widget responsive. It would have been easy, had we implemented neural networks ourselves. But we use the scikit-learn implementation, which does not expose an option to make additional function calls while fitting the network (we need to run code that communicates with the interface). We had to resort to a trick: we modified fitting so that a change to an attribute called n_iters_ called a function (see pull request). Not the cleanest solution, but it seems to work.

For now, only a few widgets work so that the interface remains responsive. We are still searching for the best way to make existing widgets behave nicely, but responsiveness is now one of our priorities.

Orange with Spectroscopy Add-on Workshop

We have just concluded our enhanced Introduction to Data Science workshop, which included several workflows for spectroscopy analysis. Spectroscopy add-on is intended for the analysis of spectral data and it is just as fun as our other add-ons (if not more!).

We will prove it with a simple classification workflow. First, install Spectroscopy add-on from Options – Add-ons menu in Orange. Restart Orange for the add-on to appear. Great, you are ready for some spectral analysis!

Use Datasets widget and load Collagen spectroscopy data. This data contains cells measured with FTIR and annotated with the major chemical compound at the imaged part of a cell. A quick glance in a Data Table will give us an idea how the data looks like. Seems like a very standard spectral data set.

Collagen data set from Datasets widget.

 

Now we want to determine, whether we can classify cells by type based on their spectral profiles. First, connect Datasets to Test & Score. We will use 10-fold cross-validation to score the performance of our model. Next, we will add Logistic Regression to model the data. One final thing. Spectral data often needs some preprocessing. Let us perform a simple preprocessing step by applying Cut (keep) filter and retaining only the wave numbers between 1500 and 1800. When we connect it to Test & Score, we need to keep in mind to connect the Preprocessor output of Preprocess Spectra.

Preprocessor that keeps a part of the spectra cut between 1500 and 1800. No data is shown here, since we are using only the preprocessing procedure as the input for Test & Score.

 

Let us see how well our model performs. Not bad. A 0.99 AUC score. Seems like it is almost perfect. But is it really so?

10-fold cross-validation on spectral data. Our AUC and CA scores are quite impressive.

 

Confusion Matrix gives us a detailed picture. Our model fails almost exclusively on DNA cell type. Interesting.

Confusion Matrix shows DNA is most often misclassified. By selecting the misclassified instances in the matrix, we can inspect why Logistic Regression couldn’t model these spectra

 

We will select the misclassified DNA cells and feed them to Spectra to inspect what went wrong. Instead of coloring by type, we will color by prediction from Logistic Regression. Can you find out why these spectra were classified incorrectly?

Misclassified DNA spectra colored by the prediction made by Logistic Regression.

 

This is one of the simplest examples with spectral data. It is basically the same procedure as with standard data – data is fed as data, learner (LR) as learner and preprocessor as preprocessor directly to Test & Score to avoid overfitting. Play around with Spectroscopy add-on and let us know what you think! 🙂

Single cell analytics workshop at HHMI | Janelia

HHMI | Janelia is one of the prettiest researcher campuses I have ever visited. Located in Ashburn, VA, about 20 minutes from Washington Dulles airport, it is conveniently located yet, in a way, secluded from the buzz of the capital. We adored the guest house with a view of the lake, tasty Janelia-style breakfast (hash-browns with two eggs and sausage, plus a bagel with cream cheese) in the on-campus pub, beautifully-designed interiors to foster collaborations and interactions, and late-evening discussions in the in-house pub.

All these thanks to the invitation of Andrew Lemire, a manager of a shared high-throughput genomics resource, and Dr. Vilas Menon, a mathematician specializing in quantitative genomics. With Andy and Vilas, we have been collaborating in the past few months on trying to devise a simple and intuitive tool for analysis of single-cell gene expression data. Single cell high-throughput technology is one of the latest approaches that allow us to see what is happening within a single cell, and it does that by simultaneously scanning through potentially thousands of cells. That generates loads of data, and apparently, we have been trying to fit Orange for single-cell data analysis task.

Namely, in the past half a year, we have been perfecting an add-on for Orange with components for single-cell analysis. This endeavor became so vital that we have even designed a new installation of Orange, called scOrange. With everything still in prototype stage, we had enough courage to present the tool at Janelia, first through a seminar, and the next day within a five-hour lecture that I gave together with Martin Strazar, a PhD student and bioinformatics expert from my lab. Many labs are embarking on single cell technology at Janelia, and by the crowd that gathered at both events, it looks like that everyone was there.

Orange, or rather, scOrange, worked as expected, and hands-on workshop was smooth, despite testing the software on some rather large data sets. Our Orange add-on for single-cell analytics is still in early stage of development, but already has some advanced features like biomarker discovery and tools for characterization of cell clusters that may help in revealing hidden relations between genes and phenotypes. Thanks to Andy and Vilas, Janelia proved an excellent proving ground for scOrange, and we are looking forward to our next hands-on single-cell analytics workshop in Houston.

Related: Hands-On Data Mining Course in Houston