Single cell analytics workshop at HHMI | Janelia

HHMI | Janelia is one of the prettiest researcher campuses I have ever visited. Located in Ashburn, VA, about 20 minutes from Washington Dulles airport, it is conveniently located yet, in a way, secluded from the buzz of the capital. We adored the guest house with a view of the lake, tasty Janelia-style breakfast (hash-browns with two eggs and sausage, plus a bagel with cream cheese) in the on-campus pub, beautifully-designed interiors to foster collaborations and interactions, and late-evening discussions in the in-house pub.

All these thanks to the invitation of Andrew Lemire, a manager of a shared high-throughput genomics resource, and Dr. Vilas Menon, a mathematician specializing in quantitative genomics. With Andy and Vilas, we have been collaborating in the past few months on trying to devise a simple and intuitive tool for analysis of single-cell gene expression data. Single cell high-throughput technology is one of the latest approaches that allow us to see what is happening within a single cell, and it does that by simultaneously scanning through potentially thousands of cells. That generates loads of data, and apparently, we have been trying to fit Orange for single-cell data analysis task.

Namely, in the past half a year, we have been perfecting an add-on for Orange with components for single-cell analysis. This endeavor became so vital that we have even designed a new installation of Orange, called scOrange. With everything still in prototype stage, we had enough courage to present the tool at Janelia, first through a seminar, and the next day within a five-hour lecture that I gave together with Martin Strazar, a PhD student and bioinformatics expert from my lab. Many labs are embarking on single cell technology at Janelia, and by the crowd that gathered at both events, it looks like that everyone was there.

Orange, or rather, scOrange, worked as expected, and hands-on workshop was smooth, despite testing the software on some rather large data sets. Our Orange add-on for single-cell analytics is still in early stage of development, but already has some advanced features like biomarker discovery and tools for characterization of cell clusters that may help in revealing hidden relations between genes and phenotypes. Thanks to Andy and Vilas, Janelia proved an excellent proving ground for scOrange, and we are looking forward to our next hands-on single-cell analytics workshop in Houston.

Related: Hands-On Data Mining Course in Houston

How to enable SQL widget in Orange

A lot of you have been interested in enabling SQL widget in Orange, especially regarding the installation of a psycopg backend that makes the widget actually work. This post will be slightly more technical, but I will try to keep it to a minimum. Scroll to the bottom for installation instructions.

Related: SQL for Orange

Why won’t Orange recognize psycopg?

The main issue for some people was that despite having installed the psycopg module in their console, the SQL widget still didn’t work. This is because Orange uses a separate virtual environment and most of you installed psycopg in the default (system) Python environment. For psycopg to be recognized in Orange, it needs to be installed in the same virtual environment, which is normally located in C:\Users\<usr>\Anaconda3\envs\orange3 (on Windows). For the installation to work, you’d have to run it with the proper pip, namely:

C:\Users\<usr>\Anaconda3\envs\orange3\Scripts\pip.exe install psycopg2

Installation instructions

But there is a much easier way to do it. Head over to psycopg’s pip website and download the latest wheel for your platform. Py version has to be cp34 or higher (latest Orange from Anaconda comes with Python 3.6, so look for cp36).

For OSX, you would for example need: psycopg2-2.7.4-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl

For 64-bit Windows: psycopg2-2.7.4-cp36-cp36m-win_amd64.whl

And for Linux: psycopg2-2.7.4-cp36-cp36m-manylinux1_x86_64.whl

Then open the add-on dialog in Orange (Options –> Add-ons) and drag and drop the downloaded wheel into the add-on list. At the bottom, you will see psycopg2 with the tick next to it.

Click OK to run the installation. Then re-start Orange and connect to your database with SQL widget. If you have any questions, drop them in the comment section!

Image Analytics Workshop at AIUCD 2018

This week, Primož and I flew to the south of Italy to hold a workshop on Image Analytics through Data Mining at AIUCD 2018 conference. The workshop was intended to familiarize digital humanities researchers with options that visual programming environments offer for image analysis.

In about 5 hours we discussed image embedding, clustering, finding closest neighbors and classification of images. While it is often a challenge to explain complex concepts in such a short time, it is much easier when working with Orange.

Related: Image Analytics: Clustering

One of the workflows we learned at the workshop was the one for finding the most similar image in a set of images. This is better explained with an example.

We had 15 paintings from different authors. Two of them were painted by Claude Monet, a famous French impressionist painter. Our task was, given a reference image of Monet, to find his other painting in a collection.

A collection of images. It includes two Monet paintings.

First, we loaded our data set with Import Images. Then we sent our images to Image Embedding. We selected Painters embedder since it was specifically trained to recognize authors of paintings.

We used Painters embedder here.

Once we have described our paintings with vectors (embeddings), we can compare them by similarity. To find the second Monet in a data set, we will have to compute the similarity of paintings and find the one most similar one to our reference painting.

Related: Video on image clustering

Let us connect Image Embedding to Neighbors from Prototypes add-on. Neighbors widget is specifically intended to find a number of closest neighbors given a reference data point.

We will need to adjust the widget a bit. First, we will need cosine distance, since we will be comparing images by the content, not the magnitude of features. Next, we will tick off Exclude reference, in order to receive the reference image on the output. We do this just for visualization purposes. Finally, we set the number of neighbors to 2. Again, this is just for a nicer visualization, since we know there are only two Monet’s paintings in the data set.

Neighbors was set to provide a nice visualization. Hence we ticked off Exclude references and set Neighbors to 2.

Then we need to give Neighbors a reference image, for which we want to retrieve the neighbors. We do this by adding Data Table to Image Embedding, selecting one of Monet’s paintings in the spreadsheet and then connecting the Data Table to Neighbors. The widget will automatically consider the second input as a reference.

Monet.jpg is our reference painting. We select it in Data Table.

Now, all we need to do is to visualize the output. Connect Image Viewer to Neighbors and open it.

Voila! The widget has indeed found the second Monet’s painting. So useful when you have thousands of images in your archive!

Visualizing multiple variables: FreeViz

Scatter plots are great! But sometimes, we need to plot more than two variables to truly understand the data. How can we achieve this, knowing humans can only grasp up to three dimensions? With an optimization of linear projection, of course!

Orange recently re-introduced FreeViz, an interactive visualization for plotting multiple variables on a 2-D plane.

Let’s load zoo.tab data with File widget and connect FreeViz to it. Zoo data has 16 features describing animals of different types – mammals, amphibians, insects and so on. We would like to use FreeViz to show us informative features and create a visualization that separates well between animal types.

FreeViz with initial, un-optimized plot.

We start with un-optimized projection, where data points are scattered around features axes. Once we click Optimize, we can observe optimization process in real-time and at the end see the optimized projection.

FreeViz with optimized projection.

This projection is much more informative. Mammals are nicely grouped together within a pink cluster that is characterized by hair, milk, and toothed features. Conversely, birds are charaterized by eggs, feathers and airborne, while fish are aquatic. Results are as expected, which means optimization indeed found informative features for each class value.

FreeViz with Show class density option.

Since we are working with categorical class values, we can tick Show class density to color the plot by majority class values. We can also move anchors around to see how data points change in relation to a selected anchor.

Finally, as in most Orange visualizations, we can select a subset of data points and explore them further. For example, let us observe which amphibians are characterized by being aquatic in a Data Table. A newt, a toad and two types of frogs, one venomous and one not.

Data exploration is always much easier with clever visualizations!

Stack Everything!

We all know that sometimes many is better than few. Therefore we are happy to introduce the Stack widget. It is available in Prototypes add-on for now.

Stacking enables you to combine several trained models into one meta model and use it in Test&Score just like any other model. This comes in handy with complex problems, where one classifier might fail, but many could come up with something that works. Let’s see an example.

We start with something as complex as this. We used Paint Data to create a complex data set, where classes somewhat overlap. This is naturally an artificial example, but you can try the same on your own, real life data.

We used 4 classes and painted a complex, 2-dimensional data set.

 

Then we add several kNN models with different parameters, say 5, 10 and 15 neighbors. We connect them to Test&Score and use cross validation to evaluate their performance. Not bad, but can we do even better?

Scores without staking, using only 3 different kNN classifiers.

 

Let us try stacking. We will connect all three classifiers to the Stacking widget and use Logistic Regression as an aggregate, a method that aggregates the three models into a single meta model. Then we connect connect the stacked model into Test&Score and see whether our scores improved.

Scores with stacking. Stack reports on improved performance.

 

And indeed they have. It might not be anything dramatic, but in real life, say medical context, even small improvements count. Now go and try the procedure on your own data. In Orange, this requires only a couple of minutes.

Final workflow with channel names. Notice that Logistic Regression is used as Aggregate, not a Learner.

Speeding Up Network Visualization

The Orange3 Network add-on contains a convenient Network Explorer widget for network visualization. Orange uses an iterative force-directed method (a variation of the Fruchterman-Reingold Algorithm) to layout the nodes on the 2D plane.

The goal of force-directed methods is to draw connected nodes close to each other as if the edges that connect the nodes were acting as springs. We also don’t want all nodes crowded in a single point, but would rather have them spaced evenly. This is achieved by simulating a repulsive force, which decreases with the distance between nodes.

There are two types of forces acting on each node:

  • the attractive force towards connected adjacent nodes,
  • the repulsive force that is directed away from all other nodes.

We could say that such network visualization as a whole is rather repulsive. Let’s take for example the lastfm.net network that comes with Orange’s network add-on and which has around 1.000 nodes and 4.000 edges. In every iteration, we have to consider 4.000 attractive forces and 1.000.000 repulsive forces for every of 1.000 times 1.000 edges. It takes about 100 iterations to get a decent network layout. That’s a lot of repulsions, and you’ll have to wait a while before you get the final layout.

Fortunately, we found a simple hack to speed things up. When computing the repulsive force acting on some node, we only consider a 10% sample of other nodes to obtain an estimate. We multiply the result by 10 and hope it’s not off by too much. By choosing a different sample in every iteration we also avoid favoring some set of nodes.

The left layout is obtained without sampling while the right one uses a 10% sampling. The results are pretty similar, but the sampling method is 10 times faster!

Now that the computation is fast enough, it is time to also speed-up the drawing. But that’s a task for 2018.

How to Properly Test Models

On Monday we finished the second part of the workshop for the Statistical Office of Republic of Slovenia. The crowd was tough – these guys knew their numbers and asked many challenging questions. And we loved it!

One thing we discussed was how to properly test your model. Ok, we know never to test on the same data you’ve built your model with, but even training and testing on separate data is sometimes not enough. Say I’ve tested Naive Bayes, Logistic Regression and Tree. Sure, I can select the one that gives the best performance, but we could potentially (over)fit our model, too.

To account for this, we would normally split the data to 3 parts:

  1. training data for building a model
  2. validation data for testing which parameters and which model to use
  3. test data for estmating the accurracy of the model

Let us try this in Orange. Load heart-disease.tab data set from Browse documentation data sets in File widget. We have 303 patients diagnosed with blood vessel narrowing (1) or diagnosed as healthy (0).

Now, we will split the data into two parts, 85% of data for training and 15% for testing. We will send the first 85% onwards to build a model.

We sampled by a fixed proportion of data and went with 85%, which is 258 out of 303 patients.

We will use Naive Bayes, Logistic Regression and Tree, but you can try other models, too. This is also a place and time to try different parameters. Now we will send the models to Test & Score. We used cross-validation and discovered Logistic Regression scores the highest AUC. Say this is the model and parameters we want to go with.

Now it is time to bring in our test data (the remaining 15%) for testing. Connect Data Sampler to Test & Score once again and set the connection Remaining Data – Test Data.

Test & Score will warn us we have test data present, but unused. Select Test on test data option and observe the results. These are now the proper scores for our models.

Seems like LogReg still performs well. Such procedure would normally be useful when testing a lot of models with different parameters (say +100), which you would not normally do in Orange. But it’s good to know how to do the scoring properly. Now we’re off to report on the results in Nature… 😉

Data Mining for Business and Public Administration

We’ve been having a blast with recent Orange workshops. While Blaž was getting tanned in India, Anže and I went to the charming Liverpool to hold a session for business school professors on how to teach business with Orange.

Related: Orange in Kolkata, India

Obviously, when we say teach business, we mean how to do data mining for business, say predict churn or employee attrition, segment customers, find which items to recommend in an online store and track brand sentiment with text analysis.

For this purpose, we have made some updates to our Associate add-on and added a new data set to Data Sets widget which can be used for customer segmentation and discovering which item groups are frequently bought together. Like this:

We load the Online Retail data set.

Since we have transactions in rows and items in columns, we have to transpose the data table in order to compute distances between items (rows). We could also simply ask Distances widget to compute distances between columns instead of rows. Then we send the transposed data table to Distances and compute cosine distance between items (cosine distance will only tell us, which items are purchased together, disregarding the amount of items purchased).

Finally, we observe the discovered clusters in Hierarchical Clustering. Seems like mugs and decorative signs are frequently bought together. Why so? Select the group in Hierarchical Clustering and observe the cluster in a Data Table. Consider this an exercise in data exploration. 🙂

The second workshop was our standard Introduction to Data Mining for Ministry of Public Affairs.

Related: Analyzing Surveys

This group, similar to the one from India, was a pack of curious individuals who asked many interesting questions and were not shy to challenge us. How does a Tree know which attribute to split by? Is Tree better than Naive Bayes? Or is perhaps Logistic Regression better? How do we know which model works best? And finally, what is the mean of sauerkraut and beans? It has to be jota!

Workshops are always fun, when you have a curious set of individuals who demand answers! 🙂

Orange in Kolkata, India

We have just completed the hands-on course on data science at one the most famous Indian educational institutions, Indian Statistical Institute. A one week course was invited by Institute’s director Prof. Dr. Sanghamitra Bandyopadhyay, and financially supported by the founding of India’s Global Initiative of Academic Networks.

Indian Statistical Institute lies in the hearth of old Kolkata. A peaceful oasis of picturesque campus with mango orchards and waterlily lakes was founded by Prof. Prasanta Chandra Mahalanobis, one of the giants of statistics. Today, the Institute researches statistics and computational approaches to data analysis and runs a grad school, where a rather small number of students are hand-picked from tens of thousands of applicants.

The course was hands-on. The number of participants was limited to forty, the limitation posed by the number of the computers in Institute’s largest computer lab. Half of the students came from Institute’s grad school, and another half from other universities around Kolkata or even other schools around India, including a few participants from another famous institution, India Institutes of Technology. While the lecture included some writing on the white-board to explain machine learning, the majority of the course was about exploring example data sets, building workflows for data analysis, and using Orange on practical cases.

The course was not one of the lightest for the lecturer (Blaž Zupan). About five full hours each day for five days in a row, extremely motivated students with questions filling all of the coffee breaks, the need for deeper dive into some of the methods after questions in the classroom, and much need for improvisation to adapt our standard data science course to possibly the brightest pack of data science students we have seen so far. We have covered almost a full spectrum of data science topics: from data visualization to supervised learning (classification and regression, regularization), model exploration and estimation of quality. Plus computation of distances, unsupervised learning, outlier detection, data projection, and methods for parameter estimation. We have applied these to data from health care, business (which proposal on Kickstarter will succeed?), and images. Again, just like in our other data science courses, the use of Orange’s educational widgets, such as Paint Data, Interactive k-Means, and Polynomial Regression helped us in intuitive understanding of the machine learning techniques.

The course was beautifully organized by Prof. Dr. Saurabh Das with the help of Prof. Dr. Shubhra Sankar Ray and we would like to thank them for their devotion and excellent organization skills. And of course, many thanks to participating students: for an educator, it is always a great pleasure to lecture and work with highly motivated and curious colleagues that made our trip to Kolkata fruitful and fun.

Neural Network is Back!

We know you’ve missed it. We’ve been getting many requests to bring back Neural Network widget, but we also had many reservations about it.

Neural networks are powerful and great, but to do them right is not straight-forward. And to do them right in the context of a GUI-based visual programming tool like Orange is a twisted double helix of a roller coaster.

Do we make each layer a widget and then stack them? Do we use parallel processing or try to do something server-side? Theano or Keras? Tensorflow perhaps?

We were so determined to do things properly, that after the n-th iteration we still had no clue what to actually do.

Then one day a silly novice programmer (a.k.a. me) had enough and just threw scikit-learn’s Multi-layer Perceptron model into a widget and called it a day. There you go. A Neural Network widget just like it was in Orange2 – a wrapper for a scikit’s function that works out-of-the-box. Nothing fancy, nothing powerful, but it does its job. It models things and it predicts things.

Just like that:

Have fun with the new widget!