Debian packages support multiple Python versions now

We have created Debian packages for multiple Python versions. This means that they work now with both Python 2.6 and 2.7 out of the box, or if you compile them manually, with any (supported) version you have installed on your (Debian-based) system.

Practically, this means that now you can install them without manual compiling on current Debian and Ubuntu systems. Give it a try, add our Debian package repository, apt-get install python-orange for Orange library/modules and/or orange-canvas for GUI. If you install the later package, type orange in the terminal and Orange canvas will pop-up.

3D Visualizations in Orange

Over the summer I worked (and still do) on several new 3D visualization widgets as well as a 3D plotting library they use, which will hopefully simplify making more widgets. The library is designed to be similar in terms of API to the new Qt plotting library Noughmad is working on.

The library uses OpenGL 2/3: since Khronos deprecated parts of the old OpenGL API (particularly immediate mode and fixed-function functionality) care has been taken to use only capabilities less likely to go away in the years to come. All the drawing is done using shaders; geometry data is fed to the graphics hardware using Vertex Buffers. The library is fully functional under OpenGL 2.0; when hardware supports newer versions (3+), several optimizations are possible (e.g. geometry processing is done on the GPU rather than on CPU), possibly resulting in improved user experience.

Widgets I worked on and are reasonably usable:


ScatterPlot3D displaying the Titanic dataset

Its GUI has the same options as the ordinary ScatterPlot (2D),with an additional dropdown for the third attribute (Z) and some new checkboxes (e.g. 2D/3D symbols). The data can be easily rotated, translated and scaled.Supports zoom levels and selections as well. VizRank works.Thanks to hardware acceleration, ScatterPlot3D is quite responsive even with largerdatasets (30k examples).


LinProj3D in action

LinProj3D is displayed using dark theme (themes are available in all 3D widgets).



Sphereviz3D has 2D symbols option enabled (also available in all 3D widgets). VizRank has been modified to work with three dimensions; PCA and SPCA options under FreeViz return first three most important components when used in these widgets.


Documentation for widgets and the library is still missing. Some additional widgets are being considered, such as NetExplorer3D.

I wrote few technical details here.

GSoC Review: Visualizations with Qt

During the course of this summer, I created a new plotting library for Orange plot, replacing the use of PyQwt. I can say that I have succesfully completed my project, but the library (and especially the visualization widgets) could still use some more work. The new library supports a similar interface, so little change is needed to convert individual widgets, but it also has several advantages over the old implementation:

  • Animations: When using a single curve to show all data points, data changes only move the points instead of replacing them. These moves are now animated, as are color and size changes.
  • Multithreading: All position calculations are done in separate threads, so the interface remains responsive even when an long operation is running in the background.
  • Speed: I removed several occurances of needlessly clearing and repopulating the graph.
  • Simplicity: Because it was written with Orange in mind, the new library has functions that match Orange’s data structures. This leads to simpler code in widgets using the library, and less operations in Python.
  • Appearance: The plot can use the system palette, or a custom color theme. In general, I think it looks much nicer that Qwt-based plots.
  • Documentation: There is an extensive API documentation (will soon be available at Orange 2.5 documentation), as well as two widget examples.

However, there are also disadvantages to using the new library. They are not many, and I’ve been trying to keep them as few and as small as possible, but there still are some.

  • Line rendering: For some reason, whenever lines are rendered on the plot, the whole widget starts acting very slow. The effect is even more noticeable when zooming. As far as I can tell, this happens in Qt’s drawing libraries, so there is not much I can do about it.
  • Axis labels: With a large number of long axis labels, the formatting gets rather ugly. This is a minor inconvenience, but it does make the plots look unprofessional.

Fortunately, I have little school obligations this september, so I think I will be able to work on Orange some more, at least until school starts. I have already added gesture support and some minor improvements since the end of the program.

Finally, I’d like to take this opportunity to thank the Orange team, especially my mentor Miha, for accepting me and helping me throughout the summer. It’s been an interesting project, and I’ll be happy to continue working with the same software and the same team.

GSoC Review: Multi-label Classification Implementation

Traditional single-label classification is concerned with learning from a set of examples that are associated with a single label l from a set of disjoint labels L, |L| > 1. If |L| = 2, then the learning problem is called a binary classification problem, while if |L| > 2, then it is called a multi-class classification problem (Tsoumakas & Katakis, 2007).

Multi-label classification methods are increasingly used by many applications, such as textual data classification, protein function classification, music categorization and semantic scene classification. However, currently, Orange can only handle single-label problems. As a result, the project Multi-label classification Implementation has been proposed to extend Orange to support multi-label.

We can group the existing methods for multi-label classification into two main categories: a) problem transformation method, and b) algorithm adaptation methods. In the former one, multi-label problems are converted to single-label, and then the traditional binary classification can apply; in the latter case, methods directly classify the multi-label data, instead.

In this project, two transformation methods and two algorithm adaptation methods are implemented. Along with the methods, their widgets are also added. As the evaluation metrics for multi-label data are different from the single-label ones, new evaluation measures are supported. The code is available in SVN branch.

Fortunately, benefiting from the Tab file format, the ExampleTable can store multi-label data without any modification. Now, we can add a special value – label into the attributes dictionary of the domain with value 1. In this way, if the attribute description has the keyword label, then it is a label; otherwise, it is a normal feature.

What have been done in this project

Transformation methods

  • br – Binary Relevance Learner (Tsoumakas & Katakis, 2007)
  • lp – Label Powerset Classification (Tsoumakas & Katakis, 2007)

Algorithm Adaptation methods

  • mlknn – Multi-kNN Classification (Zhang & Zhou, 2007)
  • brknn – BR-kNN Classification (Spyromitros et al. 2008)

Evaluation methods

  • mlc_hamming_loss – Example-based Hamming Loss (Schapire and Singer 2000)
  • mlc_accuracy, mlc_precision, mlc_recall – Example-based accuracy, precision, recall (Godbole & Sarawagi, 2004)


  • OWBR – Widget for Binary Relevance Learner
  • OWLP – Widget for Label Powerset Classification
  • OWMLkNN – Widget for Multi-kNN Classification
  • OWBRkNN – Widget for BR-kNN Classification
  • OWTestLearner – Widget for Evaluation

File Format Extension

Plan for the future

  • add more classification methods for multi-label, such as PT1 to PT6
  • add feature extraction method
  • add ranking-based evaluation methods

How to use

Basically, the way to use multi-label classification and evaluation is nearly the same as the single-label ones. The only difference between them is the different types of input data.

Example for Classification

import Orange

data ="")

classifier = Orange.multilabel.BinaryRelevanceLearner(data)

for e in data:
    c,p = classifier(e,Orange.classification.Classifier.GetBoth)
    print c,p

powerset_cliassifer = Orange.multilabel.LabelPowersetLearner(data)
for e in data:
    c,p = powerset_cliassifer(e,Orange.classification.Classifier.GetBoth)
    print c,p

mlknn_cliassifer = Orange.multilabel.MLkNNLearner(data,k=1)
for e in data:
    c,p = mlknn_cliassifer(e,Orange.classification.Classifier.GetBoth)
    print c,p
br_cliassifer = Orange.multilabel.BRkNNLearner(data,k=1)
for e in data:
    c,p = br_cliassifer(e,Orange.classification.Classifier.GetBoth)
    print c,p

Example for Evaluation

import Orange

learners = [
    Orange.multilabel.BinaryRelevanceLearner(name="br", \
    Orange.multilabel.LabelPowersetLearner(name="lp", \

data ="emotions.xml")

res = Orange.evaluation.testing.cross_validation(learners, data,2)
loss = Orange.evaluation.scoring.mlc_hamming_loss(res)
accuracy = Orange.evaluation.scoring.mlc_accuracy(res)
precision = Orange.evaluation.scoring.mlc_precision(res)
recall = Orange.evaluation.scoring.mlc_recall(res)
print 'loss=', loss
print 'accuracy=', accuracy
print 'precision=', precision
print 'recall=', recall


  • G. Tsoumakas and I. Katakis. Multi-label classification: An overview”. International Journal of Data Warehousing and Mining, 3(3):1-13, 2007.
  • E. Spyromitros, G. Tsoumakas, and I. Vlahavas, An Empirical Study of Lazy Multilabel Classification Algorithms. Proc. 5th Hellenic Conference on Artificial Intelligence (SETN 2008), Springer, Syros, Greece, 2008.
  • M. Zhang and Z. Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40, 7 (Jul. 2007), 2038-2048.
  • S. Godbole and S. Sarawagi. Discriminative Methods for Multi-labeled Classification, Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2004.
  • R. E. Schapire and Y. Singer. Boostexter: a bossting-based system for text categorization, Machine Learning, vol.39, no.2/3, 2000, pp:135-68.

GSoC Review: MF – Matrix Factorization Techniques for Data Mining

MF – Matrix Factorization Techniques for Data Mining is a Python scripting library which includes a number of published matrix factorization algorithms, initialization methods, quality and performance measures and facilitates the combination of these to produce new strategies. The library represents a unified and efficient interface to matrix factorization algorithms and methods.

The MF works with numpy dense matrices and scipy sparse matrices (where this is possible to save on space). The library has support for multiple runs of the algorithms which can be used for some quality measures. By setting runtime specific options tracking the residuals error within one (or more) run or tracking fitted factorization model is possible. Extensive documentation with working examples which demonstrate real applications, commonly used benchmark data and visualization methods are provided to help with the interpretation and comprehension of the results.

Content of Current Release

Factorization Methods

  • BD – Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]
  • BMF – Binary matrix factorization [Zhang2007]
  • ICM – Iterated conditional modes nonnegative matrix factorization [Schmidt2009]
  • LFNMF – Fisher nonnegative matrix factorization for learning local features [Wang2004], [Li2001]
  • LSNMF – Alternative nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]
  • NMF – Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]
  • NSNMF – Nonsmooth nonnegative matrix factorization [Montano2006]
  • PMF – Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]
  • PSMF – Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]
  • SNMF – Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]
  • SNMNMF – Sparse network regularized multiple nonnegative matrix factorization [Zhang2011]

Initialization Methods

  • Random
  • Fixed
  • NNDSVD [Boutsidis2007]
  • Random C [Albright2006]
  • Random VCol [Albright2006]

Quality and Performance Measures

  • Distance
  • Residuals
  • Connectivity matrix
  • Consensus matrix
  • Entropy of the fitted NMF model [Park2007]
  • Dominant basis components computation
  • Explained variance
  • Feature score computation representing its specificity to basis vectors [Park2007]
  • Computation of most basis specific features for basis vectors [Park2007]
  • Purity [Park2007]
  • Residual sum of squares – can be used for rank estimate [Hutchins2008], [Frigyesi2008]
  • Sparseness [Hoyer2004]
  • Cophenetic correlation coefficient of consensus matrix – can be used for rank estimate [Brunet2004]
  • Dispersion [Park2007]
  • Factorization rank estimation
  • Selected matrix factorization method specific

Plans for Future

General plan for future releases of MF library is to alleviate the usage for non-technical users, increase library stability and provide comprehensive visualization methods. Specifically, in algorithm sense addition of the following could be provided.

  • Extending Bayesian methods with variational BD and linearly constrained BD.
  • Adaptation of the PMF model to interval-valued matrices.
  • Nonnegative matrix approximation. Multiplicative iterative schema.


# Import MF library entry point for factorization
import mf

from scipy.sparse import csr_matrix
from scipy import array
from numpy import dot
# We will try to factorize sparse matrix. Construct sparse matrix in CSR format.
V = csr_matrix((array([1,2,3,4,5,6]),array([0,2,2,0,1,2]),array([0,2,3,6])),shape=(3,3))

# Run Standard NMF rank 4 algorithm
# Returned object is fitted factorization model. 
# Through it user can access quality and performance measures.
fit =,method = "nmf",max_iter = 30,rank = 4,update = 'divergence',objective = 'div')

# Basis matrix. It is sparse, as input V was sparse as well.
W = fit.basis()
print "Basis matrix\n", W.todense()

# Mixture matrix. We print this tiny matrix in dense format.
H = fit.coef()
print "Coef\n", H.todense()

# Return the loss function according to Kullback-Leibler divergence. 
print "Distance Kullback-Leibler", fit.distance(metric = "kl")

# Compute generic set of measures to evaluate the quality of the factorization
sm = fit.summary()
# Print sparseness (Hoyer, 2004) of basis and mixture matrix
print "Sparseness W: %5.3f  H: %5.3f" % (sm['sparseness'][0], sm['sparseness'][1])
# Print actual number of iterations performed
print "Iterations", sm['n_iter']

# Print estimate of target matrix V
print "Estimate\n", dot(W.todense(), H.todense())


Examples with visualized results in bioinformatics, image processing, text analysis, recommendation systems are provided in Examples section of Documentation.

Figure 1: Reordered consensus matrix generated for rank = 2 on Leukemia data set.

Figure 1: Reordered consensus matrix generated for rank = 2 on Leukemia data set.

Figure 2: Interpretation of NMF – Divergence basis vectors on Medlars data set. By considering the highest weighted terms in this vector, we can assign a label or topic to basis vector W1, a user might attach the label liver to basis vector W1.

Figure 2: Interpretation of NMF - Divergence basis vectors on Medlars data set. By considering the highest weighted terms in this vector, we can assign a label or topic to basis vector W1, a user might attach the label liver to basis vector W1.

Figure 3: Basis images of LSNMF obtained after 500 iterations on original face images. The bases trained by LSNMF are additive but not spatially localized for representation of faces.

Figure 3: Basis images of LSNMF obtained after 500 iterations on original face images. The bases trained by LSNMF are additive but not spatially localized for representation of faces.