As you might know, Orange comes with several basic widget sets pre-installed. These allow you to upload and explore the data, visualize them, learn from them and make predictions. However, there are also some exciting add-ons available for installation. One of these is a bioinformatics add-on, which is our specialty.
Bioinformatics widget set allows you to pursue complex analysis of gene expression by providing access to several external libraries. There are four widgets intended specifically for this – dictyExpress, GEO Data Sets, PIPAx and GenExpress. GEO Data Sets are sourced from NCBI, PIPAx and dictyExpress from two Biolab projects, and finally GenExpress from Genialis. A lot of the data is freely accessible, while you will need a user account for the rest.
Once you open the widget, select the experiments you wish to use for your analysis and view it in the Data Table widget. You can compare these experiments in Data Profiles, visualize them in Volcano Plot, select the most relevant genes in Differential Expression widget and much more.
These databases enable you to start your research just by installing the bioinformatics add-on (Orange → Options → Add-ons…). The great thing is you can easily combine bioinformatics widgets with the basic pre-installed ones. What an easy way to immerse yourself in the exciting world of bioinformatics!
In data mining classification is one of the key methods for making predictions and gaining important information from our data. We would, for example, use classification for predicting which patients are likely to have the disease based on a given set of symptoms.
In Orange an easy way to classify your data is to select several classification widgets (e.g. Naive Bayes, Classification Tree and Linear Regression), compare the prediction quality of each learner with Test Learners and Confusion Matrix and then use the best performing classifier on a new data set for classification. Below we use Iris data set for simplicity, but the same procedure works just as well on all kinds of data sets.
We see that Classification Tree did the best with only 9 misclassified instances. To see which instances were assigned a false class, we select ‘Misclassified’ option in the widget, which highlights misclassifications and feeds them to the Scatter Plot widget. In the graph we thus see the entire data set presented with empty dots and the selected misclassifications with full dots.
Feel free to switch between learners in Confusion Matrix to see how the visualization changes for each of them.
Today we will write about cluster analysis with Hierarchical Clustering widget. We use a well-known Iris data set, which contains 150 Iris flowers, each belonging to one of the three species (setosa, versicolor and virginica). To an untrained eye the three species are very alike, so how could we best tell them apart? The data set contains measurements of sepal and petal dimensions (width and length) and we assume that these gives rise to interesting clustering. But is this so?
To find clusters, we feed the data from the File widget to Distances and then into Hierarchical Clustering. The last widget in our workflow visualizes hierarchical clustering dendrogram. In the dendrogram, let us annotate the branches with the corresponding Iris species (Annotation = Iris). We see that not all the clusters are composed of the same actual class – there are some mixed clusters with both virginicas and versicolors.
To see these clusters, we select them in Hierarchical Clustering widget by clicking on a branch. Selected data will be fed into the output of this widget. Let us inspect the data we have selected by adding Scatter Plot and PCA widgets. If we draw a Data Table directly from Hierarchical Clustering, we see the selected instances and the clusters they belong to. But if we first add the PCA widget, which decomposes the data into principal components, and then connect it to Scatter Plot, we will see the selected instances in the adjusted scatter plot (where principal components are used for x and y-axis).
Select other clusters in Hierarchical Clustering widget to see how the scatter plot visualization changes. This allows for an interesting explorative data analysis through a combination of widgets for unsupervised learning and visualizations.
Paint Data widget might initially look like a kids’ game, but in combination with other Orange widgets it becomes a very simple and useful tool for conveying statistical concepts, such as k-means, hierarchical clustering and prediction models (like SVM, logistical regression, etc.).
The widget enables you to draw your data on a 2-D plane. You can name the x and y axes, select the number of classes (which are represented by different colors) and then position the points on a graph.
The data will be represented in a data table with two attributes, where their instances correspond to coordinates in the system. Such data set is great for demonstrating k-means and hierarchical clustering methods. Just like we do below. In the screenshot we see that k-means, with our particular settings, recognizes clusters way better than hierarchical clustering. It returns a score rank, where the best score (the one with the highest value) means the most likely number of clusters. Hierarchical clustering, however, doesn’t even group the right classes together.
Another way to use Paint Data is to observe the performance of classification methods, where we can alter the graph to demonstrate improvement or deterioration of prediction models. By painting the data points we can try to construct the data set, which would be difficult for one but easy for another classifier. Say, why does linear SVM fail on the data set below?
Did you know that the widget for support vector machines (SVM) classifier can output support vectors? And that you can visualise these in any other Orange widget? In the context of all other data sets, this could provide some extra insight into how this popular classification algorithm works and what it actually does.
Ideally, that is, in the case of linear seperability, support vector machines (SVM) find a hyperplane with the largest margin to any data instance. This margin touches a small number of data instances that are called support vectors.
In Orange 3.0 you can set the SVM classification widget to output also the support vectors and visualize them. We used Iris data set in the File widget and classified data instances with SVM classifier. Then we connected both widgets with Scatterplot and selected Support Vectors in the SVM output channel. This allows us to see support vectors in the Scatterplot widget – they are represented by the bold dots in the graph.