Explorative data analysis with Hierarchical Clustering

Today we will write about cluster analysis with Hierarchical Clustering widget. We use a well-known Iris data set, which contains 150 Iris flowers, each belonging to one of the three species (setosa, versicolor and virginica). To an untrained eye the three species are very alike, so how could we best tell them apart? The data set contains measurements of sepal and petal dimensions (width and length) and we assume that these gives rise to interesting clustering. But is this so?

Hierarchical Clustering workflow
Hierarchical Clustering workflow

 

To find clusters, we feed the data from the File widget to Distances and then into Hierarchical Clustering. The last widget in our workflow visualizes hierarchical clustering dendrogram. In the dendrogram, let us annotate the branches with the corresponding Iris species (Annotation = Iris). We see that not all the clusters are composed of the same actual class – there are some mixed clusters with both virginicas and versicolors.

Selected clusters in Hierarchical Clustering widget
Selected clusters in Hierarchical Clustering widget

 

To see these clusters, we select them in Hierarchical Clustering widget by clicking on a branch. Selected data will be fed into the output of this widget. Let us inspect the data we have selected by adding Scatter Plot and PCA widgets. If we draw a Data Table directly from Hierarchical Clustering, we see the selected instances and the clusters they belong to. But if we first add the PCA widget, which decomposes the data into principal components, and then connect it to Scatter Plot, we will see the selected instances in the adjusted scatter plot (where principal components are used for x and y-axis).

HierarchicalClustering-Example2

 

Select other clusters in Hierarchical Clustering widget to see how the scatter plot visualization changes. This allows for an interesting explorative data analysis through a combination of widgets for unsupervised learning and visualizations.

 

Learn with Paint Data

Paint Data widget might initially look like a kids’ game, but in combination with other Orange widgets it becomes a very simple and useful tool for conveying statistical concepts, such as k-means, hierarchical clustering and prediction models (like SVM, logistical regression, etc.).

The widget enables you to draw your data on a 2-D plane. You can name the x and y axes, select the number of classes (which are represented by different colors) and then position the points on a graph.

PaintData-Example
Several painting tools allow you to manage your data set according to your specific needs; brush will paint several data instances at once, while put allows you paint a single data instance. Select a data subset and view it in the Data Table widget or zoom in to see the position of your points up close. Jitter and magnet are converse tools which allow either to spread the instances or draw them closer together.

 

The data will be represented in a data table with two attributes, where their instances correspond to coordinates in the system. Such data set is great for demonstrating k-means and hierarchical clustering methods. Just like we do below. In the screenshot we see that k-means, with our particular settings, recognizes clusters way better than hierarchical clustering. It returns a score rank, where the best score (the one with the highest value) means the most likely number of clusters. Hierarchical clustering, however, doesn’t even group the right classes together.

PaintData-k-means1
Paint Data widget for comparing precision of k-means and hierarchical clustering methods.

Another way to use Paint Data is to observe the performance of classification methods, where we can alter the graph to demonstrate improvement or deterioration of prediction models. By painting the data points we can try to construct the data set, which would be difficult for one but easy for another classifier. Say, why does linear SVM fail on the data set below?

PaintData-TestLearners
Use Paint Data to compare prediction quality of several classifiers.

Happy painting!

Support vectors output in SVM widget

Did you know that the widget for support vector machines (SVM) classifier can output support vectors? And that you can visualise these in any other Orange widget? In the context of all other data sets, this could provide some extra insight into how this popular classification algorithm works and what it actually does.

Ideally, that is, in the case of linear seperability, support vector machines (SVM) find a hyperplane with the largest margin to any data instance. This margin touches a small number of data instances that are called support vectors.

In Orange 3.0 you can set the SVM classification widget to output also the support vectors and visualize them. We used Iris data set in the File widget and classified data instances with SVM classifier. Then we connected both widgets with Scatterplot and selected Support Vectors in the SVM output channel. This allows us to see support vectors in the Scatterplot widget – they are represented by the bold dots in the graph.

Now feel free to try it with your own data set!

 

svm-with-support-vectors
Support vectors output of SVM widget with Iris data set.

Working with SQL data in Orange 3

Orange 3 is slowly, but steadily, gaining support for working with data stored in a SQL database. The main focus is to allow huge data sets that do not fit into RAM to be analyzed and visualized efficiently. Many widgets already recognize the type of input data and perform the necessary computations intelligently. This means that data is not downloaded from the database and analyzed locally, but is retained on the remote server, with the computation tasks translated into SQL queries and offloaded to the database engine. This approach takes advantage of the state-of-the-art optimizations relational databases have for working with data that does not fit into working memory, as well as minimizes the transfer of required information to the client.

We demonstrate how to explore and visualize data stored in a SQL table on a remote server in the following short video. It shows how to connect to the server and load the data with the SqlTable widget, manipulate the data (Select Columns, Select Rows), obtain the summary statistics (Box plot, Distributions), and visualize the data (Heat map, Mosaic Display).

 

 

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 318633

 

Viewing Images

I am lately having fun with Image Viewer. The widget has been recently updated and can display images stored locally or on the internet. But wait, what images? How on earth can Orange now display images if it can handle mere tabular or basket-based data?

Here’s an example. I have considered a subset of animals from the zoo.csv data set (comes with Orange installation), and for demonstration purposes selected only a handful of attributes. I have added a new string attribute (“images”) and declared that this is a meta attribute of the type “image”. The values of this attribute are links to images on the web:

animals-dataset.png

Here is the resulting data set, zoo-with-images.csv. I have used this data set in a schema with hierarchical clustering, where upon selection of the part of the clustering tree I can display the associated images:

animals-schema.png

Typically and just like above, you would use a string meta attribute to store the link to images. Images can be referred to using a HTTP address, or, if stored locally, using a relative path from the data file location to the image files.

Here is another example, where all the images were local and we have associated them with a famous digits data set ( digits.zip is a data set in the Orange format with the image files). The task for this data set is to classify handwritten digits based on their bitmap representation. In the schema below we wanted to find out which are the most frequent errors some classification algorithm would make, and how do the images of the misclassified digits look like. Turns out that SVM with RBF kernel most often misclassify the digit 9 and confuses it with a digit 3:

digits-schema.png

Paint Your Data

One of the widgets I enjoy very much when teaching introductory course in data mining is the Paint Data widget. When painting in this widget I would intentionally include some clusters, or intentionally obscure them. Or draw them in any strange shape. Then I would discuss with students if these clusters are identified by k-means clustering or by hierarchical clustering. We would also discuss automatic scoring of the quality of clusters, come up with the idea of a silhouette (ok, already invented, but helps if you get this idea on your own as well). And then we would play with various data sets and clustering techniques and their parameters in Orange.

Like in the following workflow where I drew three clusters which were indeed recognized by k-means clustering. Notice that silhouette scoring correctly identified even the number of clusters. And I also drew the clustered data in the Scatterplot to check if the clusters are indeed where they should be.

PaintData-k-Means-ok.png

Or like in the workflow below where k-means fails miserably (but someother clustering technique would not).

PaintData-k-Means-notok.png

Paint Data can also be used in supervised setting, for classification tasks. We can set the intended number of classes, and then chose any of these to paint the data. Below I have used it to create the datasets to check the behavior of several classifiers.

PaintData-Supervised.png

There are tons of other workflows where Paint Data can be useful. Give it a try!

3D Visualizations in Orange

Over the summer I worked (and still do) on several new 3D visualization widgets as well as a 3D plotting library they use, which will hopefully simplify making more widgets. The library is designed to be similar in terms of API to the new Qt plotting library Noughmad is working on.

The library uses OpenGL 2/3: since Khronos deprecated parts of the old OpenGL API (particularly immediate mode and fixed-function functionality) care has been taken to use only capabilities less likely to go away in the years to come. All the drawing is done using shaders; geometry data is fed to the graphics hardware using Vertex Buffers. The library is fully functional under OpenGL 2.0; when hardware supports newer versions (3+), several optimizations are possible (e.g. geometry processing is done on the GPU rather than on CPU), possibly resulting in improved user experience.

Widgets I worked on and are reasonably usable:

ScatterPlot3D

ScatterPlot3D displaying the Titanic dataset

Its GUI has the same options as the ordinary ScatterPlot (2D),with an additional dropdown for the third attribute (Z) and some new checkboxes (e.g. 2D/3D symbols). The data can be easily rotated, translated and scaled.Supports zoom levels and selections as well. VizRank works.Thanks to hardware acceleration, ScatterPlot3D is quite responsive even with largerdatasets (30k examples).

LinProj3D

LinProj3D in action

LinProj3D is displayed using dark theme (themes are available in all 3D widgets).

Sphereviz3D

Sphereviz3D

Sphereviz3D has 2D symbols option enabled (also available in all 3D widgets). VizRank has been modified to work with three dimensions; PCA and SPCA options under FreeViz return first three most important components when used in these widgets.

Future

Documentation for widgets and the library is still missing. Some additional widgets are being considered, such as NetExplorer3D.

I wrote few technical details here.

GSoC Review: Visualizations with Qt

During the course of this summer, I created a new plotting library for Orange plot, replacing the use of PyQwt. I can say that I have succesfully completed my project, but the library (and especially the visualization widgets) could still use some more work. The new library supports a similar interface, so little change is needed to convert individual widgets, but it also has several advantages over the old implementation:

  • Animations: When using a single curve to show all data points, data changes only move the points instead of replacing them. These moves are now animated, as are color and size changes.
  • Multithreading: All position calculations are done in separate threads, so the interface remains responsive even when an long operation is running in the background.
  • Speed: I removed several occurances of needlessly clearing and repopulating the graph.
  • Simplicity: Because it was written with Orange in mind, the new library has functions that match Orange’s data structures. This leads to simpler code in widgets using the library, and less operations in Python.
  • Appearance: The plot can use the system palette, or a custom color theme. In general, I think it looks much nicer that Qwt-based plots.
  • Documentation: There is an extensive API documentation (will soon be available at Orange 2.5 documentation), as well as two widget examples.

However, there are also disadvantages to using the new library. They are not many, and I’ve been trying to keep them as few and as small as possible, but there still are some.

  • Line rendering: For some reason, whenever lines are rendered on the plot, the whole widget starts acting very slow. The effect is even more noticeable when zooming. As far as I can tell, this happens in Qt’s drawing libraries, so there is not much I can do about it.
  • Axis labels: With a large number of long axis labels, the formatting gets rather ugly. This is a minor inconvenience, but it does make the plots look unprofessional.

Fortunately, I have little school obligations this september, so I think I will be able to work on Orange some more, at least until school starts. I have already added gesture support and some minor improvements since the end of the program.

Finally, I’d like to take this opportunity to thank the Orange team, especially my mentor Miha, for accepting me and helping me throughout the summer. It’s been an interesting project, and I’ll be happy to continue working with the same software and the same team.

NetworkX in Orange

NetworkX – a popular open-source python library for network analysis has finally found its way into Orange. It is now used as a base class for network representation in all Orange modules and widgets. By that, we offered to the widespread network community a fruitful and fun way to visualize and explore networks, using their existing NetworkX scripts. It has never been easier to combine network analysis and visualization with existing machine learning and data discovery methods.

Complete documentation is available in the Orange network headquarters. For a brief overview, take a look at the following example. Let us suppose we would like to analyse the data about patients, having one of two types of leukemia. So, we have a data set with 72 patient, 4600+ gene expressions and a class variable. We also have a vast network of human genes, connected if they share a biological function. What we would like to examine is a sub-network with only several hundred most expressed genes from the data set. To show off a bit, we will also use the Orange Bioinformatics add-on. Here is how we do it:

import Orange
import obiExpression

# load leukemia data set
table = Orange.data.Table("/media/Ox/Projects_Archive/res/BIO/leukemia/leukemiaGSEA.tab")

useAttributeLabels = False
ttest = obiExpression.ExpressionSignificance_TTest(table, useAttributeLabels)

target = [table.domain.classVar(0), table.domain.classVar(1)]

# test for significantly expressed genes
score = ttest(target = target)

# each gene is scored (t-test, p-value)
print score[0]
>>> (FloatVariable 'HIST1H4C', (1.8377179790830149, 0.07034778767062116))

# sort by p-value
from operator import itemgetter
score.sort(key=lambda s: s[1][1])

# select 200 genes with the lowest p-value
important_genes = [gene_var.name for gene_var, s in score[:200]]

# read the gene network (5000+ genes, dense network)
G = Orange.network.readwrite.read('genes_biofunct.gpickle')

items = G.items().filter_bool({'gene': important_genes})
indices = [i for i, present in enumerate(items) if present]

# build a subraph of 200 most expressed genes
G_sub = G.subgraph(indices)

In addition to the power of scripting environment, we also get the benefits of visual data exploration with Orange widgets. However, network widgets are currently under heavy development, so expect some bugs if you dare to try them. Coding should be finished in a month or two, check the blog for progress updates. Here is how to open the network in Nx Explorer widget:

import sys
import PyQt4

# must have OWNxExplorer in python path!
import OWNxExplorer

app=PyQt4.QtGui.QApplication(sys.argv)
ow=OWNxExplorer.OWNxExplorer()
ow.show()

# set the network
ow.set_graph(G_sub)
app.exec_()