Datasets in Orange Bioinformatics Add-On

As you might know, Orange comes with several basic widget sets pre-installed. These allow you to upload and explore the data, visualize them, learn from them and make predictions. However, there are also some exciting add-ons available for installation. One of these is a bioinformatics add-on, which is our specialty.

bioinformatics-blog

Bioinformatics widget set allows you to pursue complex analysis of gene expression by providing access to several external libraries. There are four widgets intended specifically for this – dictyExpress, GEO Data Sets, PIPAx and GenExpress. GEO Data Sets are sourced from NCBI, PIPAx and dictyExpress from two Biolab projects, and finally GenExpress from Genialis. A lot of the data is freely accessible, while you will need a user account for the rest.

Once you open the widget, select the experiments you wish to use for your analysis and view it in the Data Table widget. You can compare these experiments in Data Profiles, visualize them in Volcano Plot, select the most relevant genes in Differential Expression widget and much more.

 

BioinfoDatasets
Three widgets with experiment data libraries.

 

These databases enable you to start your research just by installing the bioinformatics add-on (Orange → Options → Add-ons…). The great thing is you can easily combine bioinformatics widgets with the basic pre-installed ones. What an easy way to immerse yourself in the exciting world of bioinformatics!

Visualizing Misclassifications

In data mining classification is one of the key methods for making predictions and gaining important information from our data. We would, for example, use classification for predicting which patients are likely to have the disease based on a given set of symptoms.

In Orange an easy way to classify your data is to select several classification widgets (e.g. Naive Bayes, Classification Tree and Linear Regression), compare the prediction quality of each learner with Test Learners and Confusion Matrix and then use the best performing classifier on a new data set for classification. Below we use Iris data set for simplicity, but the same procedure works just as well on all kinds of data sets.

Here we have three confusion matrices for Naive Bayes (top), Classification Tree (middle) and Logistic Regression (bottom).

 

Three misclassification matrices (Naive Bayes, Classification Tree and Logistic Regression)
Three misclassification matrices (Naive Bayes, Classification Tree and Logistic Regression)

 

We see that Classification Tree did the best with only 9 misclassified instances. To see which instances were assigned a false class, we select ‘Misclassified’ option in the widget, which highlights misclassifications and feeds them to the Scatter Plot widget. In the graph we thus see the entire data set presented with empty dots and the selected misclassifications with full dots.

Visualization of misclassified instances in scatter plot.
Visualization of misclassified instances in scatter plot.

 

Feel free to switch between learners in Confusion Matrix to see how the visualization changes for each of them.

 

Explorative data analysis with Hierarchical Clustering

Today we will write about cluster analysis with Hierarchical Clustering widget. We use a well-known Iris data set, which contains 150 Iris flowers, each belonging to one of the three species (setosa, versicolor and virginica). To an untrained eye the three species are very alike, so how could we best tell them apart? The data set contains measurements of sepal and petal dimensions (width and length) and we assume that these gives rise to interesting clustering. But is this so?

Hierarchical Clustering workflow
Hierarchical Clustering workflow

 

To find clusters, we feed the data from the File widget to Distances and then into Hierarchical Clustering. The last widget in our workflow visualizes hierarchical clustering dendrogram. In the dendrogram, let us annotate the branches with the corresponding Iris species (Annotation = Iris). We see that not all the clusters are composed of the same actual class – there are some mixed clusters with both virginicas and versicolors.

Selected clusters in Hierarchical Clustering widget
Selected clusters in Hierarchical Clustering widget

 

To see these clusters, we select them in Hierarchical Clustering widget by clicking on a branch. Selected data will be fed into the output of this widget. Let us inspect the data we have selected by adding Scatter Plot and PCA widgets. If we draw a Data Table directly from Hierarchical Clustering, we see the selected instances and the clusters they belong to. But if we first add the PCA widget, which decomposes the data into principal components, and then connect it to Scatter Plot, we will see the selected instances in the adjusted scatter plot (where principal components are used for x and y-axis).

HierarchicalClustering-Example2

 

Select other clusters in Hierarchical Clustering widget to see how the scatter plot visualization changes. This allows for an interesting explorative data analysis through a combination of widgets for unsupervised learning and visualizations.

 

Learn with Paint Data

Paint Data widget might initially look like a kids’ game, but in combination with other Orange widgets it becomes a very simple and useful tool for conveying statistical concepts, such as k-means, hierarchical clustering and prediction models (like SVM, logistical regression, etc.).

The widget enables you to draw your data on a 2-D plane. You can name the x and y axes, select the number of classes (which are represented by different colors) and then position the points on a graph.

PaintData-Example
Several painting tools allow you to manage your data set according to your specific needs; brush will paint several data instances at once, while put allows you paint a single data instance. Select a data subset and view it in the Data Table widget or zoom in to see the position of your points up close. Jitter and magnet are converse tools which allow either to spread the instances or draw them closer together.

 

The data will be represented in a data table with two attributes, where their instances correspond to coordinates in the system. Such data set is great for demonstrating k-means and hierarchical clustering methods. Just like we do below. In the screenshot we see that k-means, with our particular settings, recognizes clusters way better than hierarchical clustering. It returns a score rank, where the best score (the one with the highest value) means the most likely number of clusters. Hierarchical clustering, however, doesn’t even group the right classes together.

PaintData-k-means1
Paint Data widget for comparing precision of k-means and hierarchical clustering methods.

Another way to use Paint Data is to observe the performance of classification methods, where we can alter the graph to demonstrate improvement or deterioration of prediction models. By painting the data points we can try to construct the data set, which would be difficult for one but easy for another classifier. Say, why does linear SVM fail on the data set below?

PaintData-TestLearners
Use Paint Data to compare prediction quality of several classifiers.

Happy painting!

Support vectors output in SVM widget

Did you know that the widget for support vector machines (SVM) classifier can output support vectors? And that you can visualise these in any other Orange widget? In the context of all other data sets, this could provide some extra insight into how this popular classification algorithm works and what it actually does.

Ideally, that is, in the case of linear seperability, support vector machines (SVM) find a hyperplane with the largest margin to any data instance. This margin touches a small number of data instances that are called support vectors.

In Orange 3.0 you can set the SVM classification widget to output also the support vectors and visualize them. We used Iris data set in the File widget and classified data instances with SVM classifier. Then we connected both widgets with Scatterplot and selected Support Vectors in the SVM output channel. This allows us to see support vectors in the Scatterplot widget – they are represented by the bold dots in the graph.

Now feel free to try it with your own data set!

 

svm-with-support-vectors
Support vectors output of SVM widget with Iris data set.