Girls Go Data Mining

This week we held our first Girls Go Data Mining workshop. The workshop brought together curious women and intuitively introduced them to essential data mining and machine learning concepts. Of course, we used Orange to explore visualizations, build predictive models, perform clustering and dive into text analysis. The workshop was supported by NumFocus through their small development grant initiative and we hope to repeat it next year with even more ladies attending!

Related: Text Analysis for Social Scientists

In two days, we covered many topics. On day one, we got to know Orange and the concept of visual programming, where the user construct analytical workflow by stacking visual components. Then we got to know several useful visualizations, such as box plot, scatter plot, distributions, and mosaic display, which give us an initial overview of the data and the potentially interesting patterns. Finally, we got our hands dirty with predictive modeling. We learnt about decision trees, logistic regression, and naive Bayes classifiers, and observed the models in tree viewer and nomogram. It is great having interpretable models and we had great fun exploring what is in the model!

On the second day, we tried to uncover groups in our data with clustering. First, we tried hierarchical clustering and explored the discovered clusters with box plot. Then we also tried k-means and learnt, why this method is better than hierarchical clustering. In the final part, we talked about the methods for text mining, how to do preprocessing, construct a bag of words and perform the machine learning on corpora. We used both clustering and classification and tried to find interesting information about Grimm tales.

One of our workflows, where we explored the data in many different ways, including inspecting misclassifications in a scatter plot!

 

One thing that always comes up as really useful in our workshops is Orange’s ability to output different types of data. For example, in Hierarchical Clustering, we can select the similarity cutoff at the top and output clusters. Our data table will have an additional column Cluster, with cluster labels for each data instance.

 

Hierarchial Clustering outputs data with an additional Cluster column.

 

We can explore clusters by connecting a Box Plot to Hierarchical Clustering, selecting Cluster in Subgroups and using Order by relevance option. This sorts the variables in Box Plot by how well they separate between clusters or, in other words, what is typical of each cluster.

We have selected Cluster in Subgroups section and ticked ‘Order by relevance’ to sort the variables. Variables at the top are the most interesting ones. Looks like giving milk is an exclusive property of cluster C1.

 

We used zoo.tab and made the cutoff at three clusters. It looks like the first cluster gives milk. Could these be a cluster of mammals?

We said giving milk is a property of cluster C1. By selecting type as our variable, we can see that C1 is a cluster of mammals.

 

Indeed it is!

Another option is to select a specific cluster in the dendrogram. Then, we have to rewire the connection between Hierarchical Clustering and Box Plot by setting it to Data. Data option will output the entire data set, with an extra column showing whether the data instance was selected or not. In our case, there would be a Yes if the instance is in the selected cluster and No if it is not.

To rewire the connection, double-click on it and drag a line from Data to Data.

 

We have selected one cluster in the dendrogram, rewired the connection to transmit Data (instead of Selected Data) and observed the results in a Data Table. We see an additional Selected column, which shows whether a data instance was selected in the visualization or not.

 

Then we can use Box Plot to observe what is particular for our selected cluster.

In this Box Plot we have used Selected in the Subgroups section and kept ‘Order by relevance’ on. The suggested distinctive feature of our selected cluster is having feathers.

 

It looks like animals from our selected cluster have feathers. Probably, this is a cluster of birds. We can check this with the same procedure as above.

In summary, most Orange visualizations have two outputs – Selected Data and Data. Selected Data will output a subset of data instances selected in the visualization (or selected clusters in the case of hierarchical clustering), while Data will output the entire data table with a column defining whether a data instance was selected or not. This is very useful if we want to inspect what is typical of an interesting group in our data, inspect clusters or even manually define groups.

Overall, this was another interesting workshop and we hope to continue our fruitful partnership with NumFocus and keep offering free educational events for beginners and experts alike!

From Surveys to Orange

Today we have finished a series of workshops for the Ministry of Public Affairs. This was a year-long cooperation and we had many students asking many different questions. There was however one that we talked about a lot. If I have a survey, how do I get it into Orange?

Related: Analyzing Surveys

We are using EnKlik Anketa service, which is a great Slovenian product offering a wide array of options for the creation of surveys. We have created one such simple survey to use as a test. I am now inside EnKlik Anketa online service and I can see my survey has been successfully filled out.

Now I have to create a public link to my survey in order to access the data in Orange. I have to click on an icon in the top right part and select ‘Public link’.

A new window opens, where I select ‘Add new public link’. This will generate a public connection to my survey results. But be careful, the type of the connection needs to be Data, not Analysis! Orange can’t read already analyzed data, it needs raw data from Data pane.

Now, all I have to do is open Orange, place EnKlik Anketa widget from the Prototypes add-on onto the canvas, enter the public link into the ‘Public link URL’ fields and press Enter. If your data has loaded successfully, the widget will display available variables and information in the Info pane.

From here on you can continue your analysis just like you would with any other data source!

Spectroscopy Workshop at BioSpec and How to Merge Data

Last week Marko and I visited the land of the midnight sun – Norway! We held a two-day workshop on spectroscopy data analysis in Orange at the Norwegian University of Life Sciences. The students from BioSpec lab were yet again incredible and we really dug deep into Orange.

Related: Orange with Spectroscopy Add-on

A class full of dedicated scientists.

 

One thing we did was see how to join data from two different sources. It would often happen that you have measurements in one file and the labels in the other. Or in our case, we wanted to add images to our zoo.tab data. First, find the zoo.tab in the File widget under Browse documentation datasets. Observe the data in the Data Table.

Original zoo data set.

 

This data contains 101 animal described with 16 different features (hair, aquatic, eggs, etc.), a name and a type. Now we will manually create the second table in Excel. The first column will contain the names of the animals as they appear in the original file. The second column will contain links to images of animals. Open your favorite browser and find a couple of images corresponding to selected animals. Then add links to images below the image column. Just like that:

Extra data that we want to add to the original data.

 

Remember, you need a three-row header to define the column that contains images. Under the image column add string in the second and type=image in the third row. This will tell Orange where to look for images. Now, we can check our animals in Image Viewer.

A quick glance at an Image Viewer will tell us whether our images got loaded correctly.

 

Finally, it is time to bring in the images to the existing zoo data set. Connect the original File to Merge Data. Then add the second file with animal images to Merge Data. The default merging method will take the first data input as original data and the second data as extra data. The column to match by is defined in the widget. In our case, it is the name column. This means Orange will look at the first name column and find matching instances in the second name column.

 

A quick look at the merged data shows us an additional image column that we appended to the original file.

Merged data with a new column.

 

This is the final workflow. Merge Data now contains a single data table on the output and you can continue your analysis from there.

Find out more about spectroscopy for Orange on our YouTube channel or contribute to the project on Github.

Python Script: Managing Data on the Fly

Python Script is this mysterious widget most people don’t know how to use, even those versed in Python. Python Script is the widget that supplements Orange functionalities with (almost) everything that Python can offer. And it’s time we unveil some of its functionalities with a simple example.

Example: Batch Transform the Data

There might be a time when you need to apply a function to all your attributes. Say you wish to log-transform their values, as it is common in gene expression data. In theory, you could do this with Feature Constructor, where you would log-transform every attribute individually. Sounds laborious? It’s because it is. Why else we have computers if not to reduce manual labor for certain tasks? Let’s do it the fast way – with Python Script.

First, open File widget and load geo-gds360.tab from Browse documentation data sets. This data set has 9485 features, so imagine having to transform each feature individually.

Instead, we will connect Python Script to File and use a simple script to apply the same transformation to all attributes.

import numpy as np
from Orange.data import Table

new_X = np.log(in_data.X)
out_data = Table(in_data.domain, new_X, in_data.Y, in_data.metas)

This is really simple. Use in_data.X, which accesses all features in the data set, to transform the data with np.log (or any other numpy function). Set out_data to new_X and, voila, the transformed data is on the output. In a few lines we have instantly handled all 9485 features.

You can inspect the data before and after transformation in a Data Table widget.

Original data.
Log-transformed data.

 

This is it. Now we can do our standard analysis on the transformed data. Even better! We can save our script and use it in Python Script widget any time we want.

For your convenience I have already added the Log Attributes Script, so you can download and use it instantly!

Have a more interesting example with Python Script? We’d love to hear about it!

Data Mining Course at Higher School of Economics, Moscow

Janez and I have recently returned from a two-week stay in Moscow, Russian Federation, where we were teaching data mining to MA students of Applied Statistics. This is a new Master’s course that attracts the best students from different backgrounds and teaches them statistical methods for work in the industry.

It was a real pleasure working at HSE. The students were proactive by asking questions and really challenged us to do our best.

One of the things we did was compute minimum cost of misclassifications. The story goes like this. Sara is a doctor and has data on 303 patients with heart disease (Orange’s heart-disease.tab data set). She used some classifiers and now has to decide how many patients to send for further tests. Naive Bayes classifier, for example, returned probabilities of a patient being sick (column Naive Bayes 1). For each threshold in probabilites, she will compute how many false positives (patients declared sick when healthy) and how many false negatives (patients declared healthy when sick) a classifiers returns. Each mistake is associated with a cost. Now she wants to find out, how many patients to send for tests (what probability threshold to choose) so that her cost is the lowest.

First, import all the libraries we will need:

import matplotlib.pyplot as plt
import numpy as np

from Orange.data import Table
from Orange.classification import NaiveBayesLearner, TreeLearner
from Orange.evaluation import CrossValidation

Then load heart disease data (and print a sample).

heart = Table("heart_disease")
print(heart[:5])

Now, train classifiers and select probabilities of Naive Bayes for a patient being sick.

scores = CrossValidation(heart, [NaiveBayesLearner(), TreeLearner()])

#take probabilites of class 1 (sick) of NaiveBayesLearner
p1 = scores.probabilities[0][:, 1]

#take actual class values
y = scores.actual

#cost of false positive (patient classified as sick when healthy)
fp_cost = 500

#cost of false negative (patient classified as healthy when sick)
fn_cost = 800

Set counts, where we declare 0 patients being sick (threshold >1).

fp = 0
#start with threshold above 1 (no one is sick)
fn = np.sum(y)

For each threshold, compute the cost associated with each type of mistake.

ps = []
costs = []

#compute costs of classifying i patients as sick
for i in np.argsort(p1)[::-1]:
    if y[i] == 0:
        fp += 1
    else:
        fn -= 1
    ps.append(p1[i])
    costs.append(fp * fp_cost + fn * fn_cost)

In the end, we get a list of probability thresholds and associated costs. Now let us find the minimum cost and its probability of a patient being sick.

costs = np.array(costs)
#find probability of a patient being sick at lowest cost
print(ps[costs.argmin()])

This means the threshold that minimizes our cost for a given classifier is 0.620655. Sara would send all the patients with a probability of being sick higher or equal than 0.620655  for further tests.

At the end, we can also plot the cost to patients sent curve.

fig, ax = plt.subplots()
plt.plot(ps, costs)
ax.set_xlabel('Patients sent')
ax.set_ylabel('Cost')

You can download the IPython Notebook here: Minimum Cost.

Installing Add-ons Works Again

Dear Orange users,

Some of you might have an issue installing add-ons with the following issue popping up:

xmlrpc.client.Fault: <Fault -32601: 'server error; requested method not found'>

This is the result of the migration to a new infrastructure at PyPi, which provides the installation of add-ons. Our team has rallied to adjust the add-on installer so it works with the new and improved service.

In order to make the add-on installer work (again), please download the latest version of Orange (3.13.0).

We apologize for any inconvenience and wish you a fruitful data analysis in the future.

Yours truly,

Orange team

Orange with Spectroscopy Add-on Workshop

We have just concluded our enhanced Introduction to Data Science workshop, which included several workflows for spectroscopy analysis. Spectroscopy add-on is intended for the analysis of spectral data and it is just as fun as our other add-ons (if not more!).

We will prove it with a simple classification workflow. First, install Spectroscopy add-on from Options – Add-ons menu in Orange. Restart Orange for the add-on to appear. Great, you are ready for some spectral analysis!

Use Datasets widget and load Collagen spectroscopy data. This data contains cells measured with FTIR and annotated with the major chemical compound at the imaged part of a cell. A quick glance in a Data Table will give us an idea how the data looks like. Seems like a very standard spectral data set.

Collagen data set from Datasets widget.

 

Now we want to determine, whether we can classify cells by type based on their spectral profiles. First, connect Datasets to Test & Score. We will use 10-fold cross-validation to score the performance of our model. Next, we will add Logistic Regression to model the data. One final thing. Spectral data often needs some preprocessing. Let us perform a simple preprocessing step by applying Cut (keep) filter and retaining only the wave numbers between 1500 and 1800. When we connect it to Test & Score, we need to keep in mind to connect the Preprocessor output of Preprocess Spectra.

Preprocessor that keeps a part of the spectra cut between 1500 and 1800. No data is shown here, since we are using only the preprocessing procedure as the input for Test & Score.

 

Let us see how well our model performs. Not bad. A 0.99 AUC score. Seems like it is almost perfect. But is it really so?

10-fold cross-validation on spectral data. Our AUC and CA scores are quite impressive.

 

Confusion Matrix gives us a detailed picture. Our model fails almost exclusively on DNA cell type. Interesting.

Confusion Matrix shows DNA is most often misclassified. By selecting the misclassified instances in the matrix, we can inspect why Logistic Regression couldn’t model these spectra

 

We will select the misclassified DNA cells and feed them to Spectra to inspect what went wrong. Instead of coloring by type, we will color by prediction from Logistic Regression. Can you find out why these spectra were classified incorrectly?

Misclassified DNA spectra colored by the prediction made by Logistic Regression.

 

This is one of the simplest examples with spectral data. It is basically the same procedure as with standard data – data is fed as data, learner (LR) as learner and preprocessor as preprocessor directly to Test & Score to avoid overfitting. Play around with Spectroscopy add-on and let us know what you think! 🙂

How to enable SQL widget in Orange

A lot of you have been interested in enabling SQL widget in Orange, especially regarding the installation of a psycopg backend that makes the widget actually work. This post will be slightly more technical, but I will try to keep it to a minimum. Scroll to the bottom for installation instructions.

Related: SQL for Orange

Why won’t Orange recognize psycopg?

The main issue for some people was that despite having installed the psycopg module in their console, the SQL widget still didn’t work. This is because Orange uses a separate virtual environment and most of you installed psycopg in the default (system) Python environment. For psycopg to be recognized in Orange, it needs to be installed in the same virtual environment, which is normally located in C:\Users\<usr>\Anaconda3\envs\orange3 (on Windows). For the installation to work, you’d have to run it with the proper pip, namely:

C:\Users\<usr>\Anaconda3\envs\orange3\Scripts\pip.exe install psycopg2

Installation instructions

But there is a much easier way to do it. Head over to psycopg’s pip website and download the latest wheel for your platform. Py version has to be cp34 or higher (latest Orange from Anaconda comes with Python 3.6, so look for cp36).

For OSX, you would for example need: psycopg2-2.7.4-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl

For 64-bit Windows: psycopg2-2.7.4-cp36-cp36m-win_amd64.whl

And for Linux: psycopg2-2.7.4-cp36-cp36m-manylinux1_x86_64.whl

Then open the add-on dialog in Orange (Options –> Add-ons) and drag and drop the downloaded wheel into the add-on list. At the bottom, you will see psycopg2 with the tick next to it.

Click OK to run the installation. Then re-start Orange and connect to your database with SQL widget. If you have any questions, drop them in the comment section!

Image Analytics Workshop at AIUCD 2018

This week, Primož and I flew to the south of Italy to hold a workshop on Image Analytics through Data Mining at AIUCD 2018 conference. The workshop was intended to familiarize digital humanities researchers with options that visual programming environments offer for image analysis.

In about 5 hours we discussed image embedding, clustering, finding closest neighbors and classification of images. While it is often a challenge to explain complex concepts in such a short time, it is much easier when working with Orange.

Related: Image Analytics: Clustering

One of the workflows we learned at the workshop was the one for finding the most similar image in a set of images. This is better explained with an example.

We had 15 paintings from different authors. Two of them were painted by Claude Monet, a famous French impressionist painter. Our task was, given a reference image of Monet, to find his other painting in a collection.

A collection of images. It includes two Monet paintings.

First, we loaded our data set with Import Images. Then we sent our images to Image Embedding. We selected Painters embedder since it was specifically trained to recognize authors of paintings.

We used Painters embedder here.

Once we have described our paintings with vectors (embeddings), we can compare them by similarity. To find the second Monet in a data set, we will have to compute the similarity of paintings and find the one most similar one to our reference painting.

Related: Video on image clustering

Let us connect Image Embedding to Neighbors from Prototypes add-on. Neighbors widget is specifically intended to find a number of closest neighbors given a reference data point.

We will need to adjust the widget a bit. First, we will need cosine distance, since we will be comparing images by the content, not the magnitude of features. Next, we will tick off Exclude reference, in order to receive the reference image on the output. We do this just for visualization purposes. Finally, we set the number of neighbors to 2. Again, this is just for a nicer visualization, since we know there are only two Monet’s paintings in the data set.

Neighbors was set to provide a nice visualization. Hence we ticked off Exclude references and set Neighbors to 2.

Then we need to give Neighbors a reference image, for which we want to retrieve the neighbors. We do this by adding Data Table to Image Embedding, selecting one of Monet’s paintings in the spreadsheet and then connecting the Data Table to Neighbors. The widget will automatically consider the second input as a reference.

Monet.jpg is our reference painting. We select it in Data Table.

Now, all we need to do is to visualize the output. Connect Image Viewer to Neighbors and open it.

Voila! The widget has indeed found the second Monet’s painting. So useful when you have thousands of images in your archive!