Network Analysis with Orange

Visualizing relations between data instances can tell us a lot about our data. Let’s see how this works in Orange. We have a data set on machine learning and data mining conferences and journals, with the number of shared authors for each publication venue reported. We can estimate similarity between two conferences using the author profile of a conference: two conference would be similar if they attract the same authors. The data set is already 9 years old, but obviously, it’s about the principle. 🙂 We’ve got two data files: one is a distance file with distance scores already calculated by Jaccard index and the other is a standard conferences.tab file.

conferences
Conferences.tab data file with the type of the publication venue (conference or journal) and average number of authors and published papers.

 

We load .tab file with the File widget (data set already comes with Orange) and .dst file with the Distance File widget (select ‘Browse documentation data sets’ and choose conferences.dst).

Distance File Widget
You can find conferences.dst in ‘Browse documentation data sets’.

 

Now we would like to create a graph from the distance file. Connect Distance File to Network from Distances. In the widget, we’ve selected a high distance threshold, because we would like to get more connections between nodes. We’ve also checked ‘Include also closest neighbors’ to see each node connected with at least one other node.

network-from-distances
We’ve set a high distance threshold, since we wanted to display connections between most of our nodes.

 

We can visualize our graph in Network Explorer. What we get is a quite uninformative network of conferences with labelled nodes. Now for the fun part. Connect the File widget with Network Explorer and set the link type to ‘Node Data’. This will match the two domains and display additional labelling options in Network Explorer.

link-to-node-data
Remove the ‘Node Subset’ link and connect Data to Node Data. This will display other attributes in Network Explorer by which you can label and color your network nodes.

 

network-explorer-conferences
Nodes are colored by event type (conference or journal) and adjusted in size by the average number of authors per event (bigger nodes represent larger events).

 

We’ve colored the nodes by type and set the size of the nodes to the number of authors per conference/paper. Finally, we’ve set the node label to ‘name’. Seems like International Conference on AI and Law and AI and Law journal are connected through the number of shared authors. Same goes for AI in Medicine in Europe conference and AI and Medicine journal. Connections indeed make sense.

conference1
The entire workflow.

 

There are many other things you can do with the Networks add-on in Orange. You can color nodes by predictions, highlight misclassifications or output only nodes with certain network parameters. But for today, let this be it.

NetworkX in Orange

NetworkX – a popular open-source python library for network analysis has finally found its way into Orange. It is now used as a base class for network representation in all Orange modules and widgets. By that, we offered to the widespread network community a fruitful and fun way to visualize and explore networks, using their existing NetworkX scripts. It has never been easier to combine network analysis and visualization with existing machine learning and data discovery methods.

Complete documentation is available in the Orange network headquarters. For a brief overview, take a look at the following example. Let us suppose we would like to analyse the data about patients, having one of two types of leukemia. So, we have a data set with 72 patient, 4600+ gene expressions and a class variable. We also have a vast network of human genes, connected if they share a biological function. What we would like to examine is a sub-network with only several hundred most expressed genes from the data set. To show off a bit, we will also use the Orange Bioinformatics add-on. Here is how we do it:

import Orange
import obiExpression

# load leukemia data set
table = Orange.data.Table("/media/Ox/Projects_Archive/res/BIO/leukemia/leukemiaGSEA.tab")

useAttributeLabels = False
ttest = obiExpression.ExpressionSignificance_TTest(table, useAttributeLabels)

target = [table.domain.classVar(0), table.domain.classVar(1)]

# test for significantly expressed genes
score = ttest(target = target)

# each gene is scored (t-test, p-value)
print score[0]
>>> (FloatVariable 'HIST1H4C', (1.8377179790830149, 0.07034778767062116))

# sort by p-value
from operator import itemgetter
score.sort(key=lambda s: s[1][1])

# select 200 genes with the lowest p-value
important_genes = [gene_var.name for gene_var, s in score[:200]]

# read the gene network (5000+ genes, dense network)
G = Orange.network.readwrite.read('genes_biofunct.gpickle')

items = G.items().filter_bool({'gene': important_genes})
indices = [i for i, present in enumerate(items) if present]

# build a subraph of 200 most expressed genes
G_sub = G.subgraph(indices)

In addition to the power of scripting environment, we also get the benefits of visual data exploration with Orange widgets. However, network widgets are currently under heavy development, so expect some bugs if you dare to try them. Coding should be finished in a month or two, check the blog for progress updates. Here is how to open the network in Nx Explorer widget:

import sys
import PyQt4

# must have OWNxExplorer in python path!
import OWNxExplorer

app=PyQt4.QtGui.QApplication(sys.argv)
ow=OWNxExplorer.OWNxExplorer()
ow.show()

# set the network
ow.set_graph(G_sub)
app.exec_()