Multi-label classification (and Multi-target prediction) in Orange

The last summer, student Wencan Luo participated in Google Summer of Code to implement Multi-label Classification in Orange. He provided a framework, implemented a few algorithms and some prototype widgets. His work has been “hidden” in our repositories for too long; finally, we have merged part of his code into Orange (widgets are not there yet …) and added a more general support for multi-target prediction.

You can load multi-label tab-delimited data (e.g. emotions.tab) just like any other tab-delimited data:

>>> zoo = Orange.data.Table('zoo')            # single-target
>>> emotions = Orange.data.Table('emotions')  # multi-label

The difference is that now zoo‘s domain has a non-empty class_var field, while a list of emotions‘ labels can be obtained through it’s domain’s class_vars:

>>> zoo.domain.class_var
EnumVariable 'type'
>>> emotions.domain.class_vars
<EnumVariable 'amazed-suprised',
 EnumVariable 'happy-pleased',
 EnumVariable 'relaxing-calm',
 EnumVariable 'quiet-still',
 EnumVariable 'sad-lonely',
 EnumVariable 'angry-aggresive'>

A simple example of a multi-label classification learner is a “binary relevance” learner. Let’s try it out.

>>> learner = Orange.multilabel.BinaryRelevanceLearner()
>>> classifier = learner(emotions)
>>> classifier(emotions[0])
[<orange.Value 'amazed-suprised'='0'>,
 <orange.Value 'happy-pleased'='0'>,
 <orange.Value 'relaxing-calm'='1'>,
 <orange.Value 'quiet-still'='1'>,
 <orange.Value 'sad-lonely'='1'>,
 <orange.Value 'angry-aggresive'='0'>]
>>> classifier(emotions[0], Orange.classification.Classifier.GetProbabilities)
[<1.000, 0.000>, <0.881, 0.119>, <0.000, 1.000>,
 <0.046, 0.954>, <0.000, 1.000>, <1.000, 0.000>]

Real values of label variables of emotions[0] instance can be obtained by calling emotions[0].get_classes(), which is analogous to the get_class method in the single-target case.

For multi-label classification, we can also perform testing like usual, however, specialised evaluation measures have to be used:

>>> test = Orange.evaluation.testing.cross_validation([learner], emotions)
>>> Orange.evaluation.scoring.mlc_hamming_loss(test)
[0.2228780213603148]

In one of the following blog posts, a multi-target regression method PLS that is in the process of implementation will be described.

GSoC Mentor Summit

On 22th and 23th October 2011 there was Google Summer of Code Mentor Summit in Mountain View, California. Google Summer of Code is Google’s program for encouraging students to work on open-source projects during their summer break. Because this year Orange participated in this program too, we decided to participate also in this summit and get to know other mentors, other open-source projects and organizations, exchange our experiences, learn something new, and improve our connections and collaborations with others.

015001002004005006008009011012014017018003007010013016019

We went to the meeting together with another Slovenian open-source project: wlan slovenija, an open wireless network initiative. Because the summit itself was held at Google’s premises, where taking photographs was forbidden, photos are mostly from the trip there itself and area around the buildings. There are some photos by others available.

Summit really satisfied all expectations. We have experienced how it is at Google’s, meet many new people, sessions were great, presenting a lot of interesting issues within open-source deployment and IT in general, and giving some ideas how to solve them. We meet many other researchers doing open-source science and developing different programs, libraries and having similar problems. We have discussed ways of solving them: how to maintain libraries we all use, how to make our projects survive, once research is completed or funding ends and we move to some other research, etc. Cooperation is the key here, but there is often not much time to do that, as it requires extra time and energy, often not a part of research projects.

GSoC Review: Visualizations with Qt

During the course of this summer, I created a new plotting library for Orange plot, replacing the use of PyQwt. I can say that I have succesfully completed my project, but the library (and especially the visualization widgets) could still use some more work. The new library supports a similar interface, so little change is needed to convert individual widgets, but it also has several advantages over the old implementation:

  • Animations: When using a single curve to show all data points, data changes only move the points instead of replacing them. These moves are now animated, as are color and size changes.
  • Multithreading: All position calculations are done in separate threads, so the interface remains responsive even when an long operation is running in the background.
  • Speed: I removed several occurances of needlessly clearing and repopulating the graph.
  • Simplicity: Because it was written with Orange in mind, the new library has functions that match Orange’s data structures. This leads to simpler code in widgets using the library, and less operations in Python.
  • Appearance: The plot can use the system palette, or a custom color theme. In general, I think it looks much nicer that Qwt-based plots.
  • Documentation: There is an extensive API documentation (will soon be available at Orange 2.5 documentation), as well as two widget examples.

However, there are also disadvantages to using the new library. They are not many, and I’ve been trying to keep them as few and as small as possible, but there still are some.

  • Line rendering: For some reason, whenever lines are rendered on the plot, the whole widget starts acting very slow. The effect is even more noticeable when zooming. As far as I can tell, this happens in Qt’s drawing libraries, so there is not much I can do about it.
  • Axis labels: With a large number of long axis labels, the formatting gets rather ugly. This is a minor inconvenience, but it does make the plots look unprofessional.

Fortunately, I have little school obligations this september, so I think I will be able to work on Orange some more, at least until school starts. I have already added gesture support and some minor improvements since the end of the program.

Finally, I’d like to take this opportunity to thank the Orange team, especially my mentor Miha, for accepting me and helping me throughout the summer. It’s been an interesting project, and I’ll be happy to continue working with the same software and the same team.

GSoC Review: Multi-label Classification Implementation

Traditional single-label classification is concerned with learning from a set of examples that are associated with a single label l from a set of disjoint labels L, |L| > 1. If |L| = 2, then the learning problem is called a binary classification problem, while if |L| > 2, then it is called a multi-class classification problem (Tsoumakas & Katakis, 2007).

Multi-label classification methods are increasingly used by many applications, such as textual data classification, protein function classification, music categorization and semantic scene classification. However, currently, Orange can only handle single-label problems. As a result, the project Multi-label classification Implementation has been proposed to extend Orange to support multi-label.

We can group the existing methods for multi-label classification into two main categories: a) problem transformation method, and b) algorithm adaptation methods. In the former one, multi-label problems are converted to single-label, and then the traditional binary classification can apply; in the latter case, methods directly classify the multi-label data, instead.

In this project, two transformation methods and two algorithm adaptation methods are implemented. Along with the methods, their widgets are also added. As the evaluation metrics for multi-label data are different from the single-label ones, new evaluation measures are supported. The code is available in SVN branch.

Fortunately, benefiting from the Tab file format, the ExampleTable can store multi-label data without any modification. Now, we can add a special value – label into the attributes dictionary of the domain with value 1. In this way, if the attribute description has the keyword label, then it is a label; otherwise, it is a normal feature.

What have been done in this project

Transformation methods

  • br – Binary Relevance Learner (Tsoumakas & Katakis, 2007)
  • lp – Label Powerset Classification (Tsoumakas & Katakis, 2007)

Algorithm Adaptation methods

  • mlknn – Multi-kNN Classification (Zhang & Zhou, 2007)
  • brknn – BR-kNN Classification (Spyromitros et al. 2008)

Evaluation methods

  • mlc_hamming_loss – Example-based Hamming Loss (Schapire and Singer 2000)
  • mlc_accuracy, mlc_precision, mlc_recall – Example-based accuracy, precision, recall (Godbole & Sarawagi, 2004)

Widgets

  • OWBR – Widget for Binary Relevance Learner
  • OWLP – Widget for Label Powerset Classification
  • OWMLkNN – Widget for Multi-kNN Classification
  • OWBRkNN – Widget for BR-kNN Classification
  • OWTestLearner – Widget for Evaluation

File Format Extension

Plan for the future

  • add more classification methods for multi-label, such as PT1 to PT6
  • add feature extraction method
  • add ranking-based evaluation methods

How to use

Basically, the way to use multi-label classification and evaluation is nearly the same as the single-label ones. The only difference between them is the different types of input data.

Example for Classification

import Orange

data = Orange.data.Table("emotions.tab")

classifier = Orange.multilabel.BinaryRelevanceLearner(data)

for e in data:
    c,p = classifier(e,Orange.classification.Classifier.GetBoth)
    print c,p

powerset_cliassifer = Orange.multilabel.LabelPowersetLearner(data)
for e in data:
    c,p = powerset_cliassifer(e,Orange.classification.Classifier.GetBoth)
    print c,p

mlknn_cliassifer = Orange.multilabel.MLkNNLearner(data,k=1)
for e in data:
    c,p = mlknn_cliassifer(e,Orange.classification.Classifier.GetBoth)
    print c,p
   
br_cliassifer = Orange.multilabel.BRkNNLearner(data,k=1)
for e in data:
    c,p = br_cliassifer(e,Orange.classification.Classifier.GetBoth)
    print c,p

Example for Evaluation

import Orange

learners = [
    Orange.multilabel.BinaryRelevanceLearner(name="br"),
    Orange.multilabel.BinaryRelevanceLearner(name="br", \
        base_learner=Orange.classification.knn.kNNLearner),
    Orange.multilabel.LabelPowersetLearner(name="lp"),
    Orange.multilabel.LabelPowersetLearner(name="lp", \
        base_learner=Orange.classification.knn.kNNLearner),
    Orange.multilabel.MLkNNLearner(name="mlknn",k=5),
    Orange.multilabel.BRkNNLearner(name="brknn",k=5),
]

data = Orange.data.Table("emotions.xml")

res = Orange.evaluation.testing.cross_validation(learners, data,2)
loss = Orange.evaluation.scoring.mlc_hamming_loss(res)
accuracy = Orange.evaluation.scoring.mlc_accuracy(res)
precision = Orange.evaluation.scoring.mlc_precision(res)
recall = Orange.evaluation.scoring.mlc_recall(res)
print 'loss=', loss
print 'accuracy=', accuracy
print 'precision=', precision
print 'recall=', recall

References

  • G. Tsoumakas and I. Katakis. Multi-label classification: An overview”. International Journal of Data Warehousing and Mining, 3(3):1-13, 2007.
  • E. Spyromitros, G. Tsoumakas, and I. Vlahavas, An Empirical Study of Lazy Multilabel Classification Algorithms. Proc. 5th Hellenic Conference on Artificial Intelligence (SETN 2008), Springer, Syros, Greece, 2008.
  • M. Zhang and Z. Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40, 7 (Jul. 2007), 2038-2048.
  • S. Godbole and S. Sarawagi. Discriminative Methods for Multi-labeled Classification, Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2004.
  • R. E. Schapire and Y. Singer. Boostexter: a bossting-based system for text categorization, Machine Learning, vol.39, no.2/3, 2000, pp:135-68.

GSoC Review: MF – Matrix Factorization Techniques for Data Mining

MF – Matrix Factorization Techniques for Data Mining is a Python scripting library which includes a number of published matrix factorization algorithms, initialization methods, quality and performance measures and facilitates the combination of these to produce new strategies. The library represents a unified and efficient interface to matrix factorization algorithms and methods.

The MF works with numpy dense matrices and scipy sparse matrices (where this is possible to save on space). The library has support for multiple runs of the algorithms which can be used for some quality measures. By setting runtime specific options tracking the residuals error within one (or more) run or tracking fitted factorization model is possible. Extensive documentation with working examples which demonstrate real applications, commonly used benchmark data and visualization methods are provided to help with the interpretation and comprehension of the results.

Content of Current Release

Factorization Methods

  • BD – Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]
  • BMF – Binary matrix factorization [Zhang2007]
  • ICM – Iterated conditional modes nonnegative matrix factorization [Schmidt2009]
  • LFNMF – Fisher nonnegative matrix factorization for learning local features [Wang2004], [Li2001]
  • LSNMF – Alternative nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]
  • NMF – Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]
  • NSNMF – Nonsmooth nonnegative matrix factorization [Montano2006]
  • PMF – Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]
  • PSMF – Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]
  • SNMF – Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]
  • SNMNMF – Sparse network regularized multiple nonnegative matrix factorization [Zhang2011]

Initialization Methods

  • Random
  • Fixed
  • NNDSVD [Boutsidis2007]
  • Random C [Albright2006]
  • Random VCol [Albright2006]

Quality and Performance Measures

  • Distance
  • Residuals
  • Connectivity matrix
  • Consensus matrix
  • Entropy of the fitted NMF model [Park2007]
  • Dominant basis components computation
  • Explained variance
  • Feature score computation representing its specificity to basis vectors [Park2007]
  • Computation of most basis specific features for basis vectors [Park2007]
  • Purity [Park2007]
  • Residual sum of squares – can be used for rank estimate [Hutchins2008], [Frigyesi2008]
  • Sparseness [Hoyer2004]
  • Cophenetic correlation coefficient of consensus matrix – can be used for rank estimate [Brunet2004]
  • Dispersion [Park2007]
  • Factorization rank estimation
  • Selected matrix factorization method specific

Plans for Future

General plan for future releases of MF library is to alleviate the usage for non-technical users, increase library stability and provide comprehensive visualization methods. Specifically, in algorithm sense addition of the following could be provided.

  • Extending Bayesian methods with variational BD and linearly constrained BD.
  • Adaptation of the PMF model to interval-valued matrices.
  • Nonnegative matrix approximation. Multiplicative iterative schema.

Usage

# Import MF library entry point for factorization
import mf

from scipy.sparse import csr_matrix
from scipy import array
from numpy import dot
# We will try to factorize sparse matrix. Construct sparse matrix in CSR format.
V = csr_matrix((array([1,2,3,4,5,6]),array([0,2,2,0,1,2]),array([0,2,3,6])),shape=(3,3))

# Run Standard NMF rank 4 algorithm
# Returned object is fitted factorization model. 
# Through it user can access quality and performance measures.
fit = mf.mf(V,method = "nmf",max_iter = 30,rank = 4,update = 'divergence',objective = 'div')

# Basis matrix. It is sparse, as input V was sparse as well.
W = fit.basis()
print "Basis matrix\n", W.todense()

# Mixture matrix. We print this tiny matrix in dense format.
H = fit.coef()
print "Coef\n", H.todense()

# Return the loss function according to Kullback-Leibler divergence. 
print "Distance Kullback-Leibler", fit.distance(metric = "kl")

# Compute generic set of measures to evaluate the quality of the factorization
sm = fit.summary()
# Print sparseness (Hoyer, 2004) of basis and mixture matrix
print "Sparseness W: %5.3f  H: %5.3f" % (sm['sparseness'][0], sm['sparseness'][1])
# Print actual number of iterations performed
print "Iterations", sm['n_iter']

# Print estimate of target matrix V
print "Estimate\n", dot(W.todense(), H.todense())

Examples

Examples with visualized results in bioinformatics, image processing, text analysis, recommendation systems are provided in Examples section of Documentation.

Figure 1: Reordered consensus matrix generated for rank = 2 on Leukemia data set.

Figure 1: Reordered consensus matrix generated for rank = 2 on Leukemia data set.

Figure 2: Interpretation of NMF – Divergence basis vectors on Medlars data set. By considering the highest weighted terms in this vector, we can assign a label or topic to basis vector W1, a user might attach the label liver to basis vector W1.

Figure 2: Interpretation of NMF - Divergence basis vectors on Medlars data set. By considering the highest weighted terms in this vector, we can assign a label or topic to basis vector W1, a user might attach the label liver to basis vector W1.

Figure 3: Basis images of LSNMF obtained after 500 iterations on original face images. The bases trained by LSNMF are additive but not spatially localized for representation of faces.

Figure 3: Basis images of LSNMF obtained after 500 iterations on original face images. The bases trained by LSNMF are additive but not spatially localized for representation of faces.

Orange GSoC: Multi-label Classification Implementation

Multi-label classification is one of the three projects of Google Summer Code 2011 for Orange. The main goal is to extend the Orange to support multi-label, including dataset support, two basic multi-label classifications-problem-transformation methods & algorithm adaptation methods, evaluation measures, GUI support, documentation, testing, and so on.

My name is Wencan Luo, from China. I’m very happy to work with my mentor Matija. Until now, we have finished a framework for multi-label support for Orange.

To support multi-label data structure, a special value is added into their ‘attributes’ dictionary. In this way, we can know whether the attribute is a type of class without altering the old Example Table class.

Moreover, a transformation classification method to support multilabel is implemented, named Binary Relevance. All the codes are extended from the old Orange code using Python to be compatible with single-label classification methods.

In addition, the evaluator for multilalbel classification is also implemented based on the old single-label evaluator in Orange.evaluator.testing and Orange.evaluator.scoring modules.

At last, the widget for Binary Relevance method and Evaluator is implemented.

Many work has to be done as following:

  • one more transformation method
  • two adaptive methods
  • ranking-based evaluator
  • widgets to support the above methods
  • testing

Orange GSoC: Visualizations with Qt

Hello, my name is Miha Čančula and this summer I’m working on Orange as part of Google’s Summer of Code program, mentored by Miha Štajdohar. My task is to replace the current visualization framework based on Qwt with a custom library, depending only on Qt. This library will better support Orange’s very specific visualizations and will replace the unmaintained PyQwt.

I have a lot of experience with Qt and its graphics classes, both in C++ and Python, but I’m relatively now to Orange. As a physics student, especially now that I’m selecting a computational physics program, this a great learning opportunity for me.

I think my work is progressing very well, because many visualizations already work with the new library with only minor modifications:

  • Scatterplot
  • Linear projections
  • Discretize
  • All the visualizations in VizRank and FreeViz dialogs

The library is written partially in C++, especially the performance-sensitive parts, and partially in Python. It uses the Qt Graphics View Framework, with several reimplemented or wrapped method to preserve the old behavior and API. I have tried to keep the necessary modification to the widgets themselves to a minimum, so the large majority of changes are in the OWGraph class which server as the base class for all graphs.

Graphs made by Qwt are not very flexible, they only support graphs with cartesian axes. On the other hand, visualization Orange often use custom axes and transformations. That’s why I designed the new library with support for arbitrary axes, curves and other elements. All of these can be extendeng with classes written in Python, C++, or a combination thereof. The required changes to visualizations I already ported to the new OWGraph class were mostly simplifications, with very little new code added.

For example, zooming and selection is implemented completely in the new OWGraph class, with the same function and attribute names as before, so visualizations themselves didn’t need any changes at all.

The new framework is also able to produce much nicer graphs. I haven’t had the time to customize the looks much, but it’s possible to set colors, line widths, point sizes and symbols from Python. There are still some settings that have no UI configuration, but I will focus on that after making sure that visualization widgets work with the new library.

Currently I’m trying to change as many widgets as possible to the new classes. As I said above, I only completed a few of them, but I believe the others won’t require too much work.

Orange GSoC: MF Techniques for Data Mining

I am one of three students who are working on GSoC projects for Orange this year. The objective of the project Matrix Factorization Techniques for Data Mining is to provide the Orange community with a unified and efficient interface to matrix factorization algorithms and methods.

For that purpose I have been developing a library which will include a number of published factorization algorithms and initialization methods and will facilitate combinations of these to produce new strategies. Extensive documentation with working examples that demonstrate applications, commonly used benchmark data and possibly some visualization methods will be provided to help with the interpretation and comprehension of the factorization results.

Main factorization techniques and their variations included in the library are:

  • family of nonnegative matrix factorization algorithms (NMF), including Brunet NMF, sparse NMF, non-smooth NMF, least-squares NMF, local Fisher NMF;
  • probabilistic factorization (PMF) and its sparse variant (PSMF);
  • Bayesian decomposition (BD);
  • iterated conditional modes (ICM) algorithm.

Different multiplicative and update algorithms for NMF will be analyzed which minimize least-squares error or generalized Kullback-Leibler divergence.

For those interested some more information with details about algorithms is available at project home page.

There is still much work to do but I have been enjoying at it and I am looking forward to continuing with the project.

Thanks to the Orange team and mentor prof. dr. Blaz Zupan for support and advice.

Accepted GSoC 2011 students announced

Accepted proposals/projects for Google Summer of Code 2011 have been announced. We got 3 students which will this year work on Orange:

Congrats to all accepted students. We are looking forward working with you. And for all other students: please apply again next year. Your proposals were good, but we just could not accept everybody.