k-Means & Silhouette Score

k-Means is one of the most popular unsupervised learning algorithms for finding interesting groups in our data. It can be useful in customer segmentation, finding gene families, determining document types, improving human resource management and so on.

But… have you ever wondered how k-means works? In the following three videos we explain how to construct a data analysis workflow using k-means, how k-means works, how to find a good k value and how silhouette score can help us find the inliers and the outliers.

 

#1 Constructing workflow with k-means

#2 How k-means works [interactive visualization]

#3 How silhouette score works and why it is useful

10 Tips and Tricks for Using Orange

TIP #1: Follow tutorials and example workflows to get started.

It’s difficult to start using new software. Where does one start, especially a total novice in data mining? For this exact reason we’ve prepared Getting Started With Orange – YouTube tutorials for complete beginners. Example workflows on the other hand can be accessed via Help – Examples.

 

TIP #2: Make use of Orange documentation.

You can access it in three ways:

  1. Press F1 when the widget is selected. This will open help screen.
  2. Select Widget – Help when the widget is selected. It works the same as above.
  3. Visit online documentation.

 

TIP #3: Embed your help screen.

Drag and drop help screen to the side of your Orange canvas. It will become embedded in the canvas. You can also make it narrower, allowing for a full-size analysis while exploring the docs.

embed-help

 

TIP #4: Use right-click.

Right-click on the canvas and a widget menu will appear. Start typing the widget you’re looking for and press Enter when the widget becomes the top widget. This will place the widget onto the canvas immediately. You can also navigate the menu with up and down.

 

TIP #5: Turn off channel names.

Sometimes it is annoying to see channel names above widget links. If you’re already comfortable using Orange, you can turn them off in Options – Settings. Turn off ‘Show channel names between widgets’.

 

TIP #6: Hide control pane.

Once you’ve set the parameters, you’d probably want to focus just on visualizations. There’s a simple way to do this in Orange. Click on the split between the control pane and visualization pane – you should see a hand appearing instead of a cursor. Click and observe how the control pane gets hidden away neatly. To make it reappear, click the split again.

panel1

panel2

 

TIP #7: Label your data.

So you’ve plotted your data, but have no idea what you’re seeing. Use annotation! In some widgets you will see a drop-down menu called Annotation, while in others it will be called a Label. This will mark your data points with suitable labels, making your MDS plots and Scatter Plots much more informative. Scatter Plot also enables you to label only selected points for better clarity.

 

TIP #8: Find your plot.

Scrolled around and lost the plot? Zoomed in too much? To re-position the plot click ‘Reset zoom’ and the visualization will jump snugly into the visualization pane. Comes in handy when browsing the subsets and trying to see the bigger picture every now and then.

zoom-pan

 

 

TIP #9: Reset widget settings.

Orange is geared to remember your last settings, thus assisting you in a rapid analysis. However, sometimes you need to start anew. Go to Options – Reset widget settings… and restart Orange. This will return Orange to its original state.

 

TIP #10: Use Educational add-on.

To learn about how some algorithms work, use Orange3-Educational add-on. It contains 4 widgets that will help you get behind the scenes of some famous algorithms. And since they’re interactive, they’re also a lot of fun!

educational

 

 

 

 

Getting Started Series: Part Two

We’ve recently published two more videos in our Getting Started with Orange series. The series is intended to introduce beginners to Orange and teach them how to use its components.

 

The first video explains how to do hierarchical clustering and select interesting subsets directly in Orange:

 

while the second video introduces classification trees and predictive modelling:

 

The seventh video in the series will address how to score classification and regression models by different evaluation methods. Fruits and vegetables data set can be found here.

 

If you have an idea what you’d like to see in the upcoming videos, let us know!

Data Fusion Tutorial at the [BC]^2

We are excited to host a three-hour tutorial on data fusion at the Basel Computational Biology Conference. To this end we have prepared a series of short lectures notes that accompany the recently developed Data Fusion Add-on for Orange.

scheme

We design the tutorial for data mining researchers and molecular biologists with interest in large-scale data integration. In the tutorial we focus on collective latent factor models, a popular class of approaches for data fusion. We demonstrate the effectiveness of these approaches on several hands-on case studies from recommendation systems and molecular biology.

This is a high-risk event. I mean, for us, lecturers. Ok, no bricks will probably fall down. But, in the part of the tutorial, this is the first time we are showing Orange’s data fusion add-on. And not just showing: part of the tutorial is a hands-on session.

We would like to acknowledge Biolab members for pushing the widgets through the development pipeline under extreme time constraints. Special thanks to Anze, Ales, Jernej, Andrej, Marko, Aleksandar and all other members of the lab.

This post was contributed by Marinka and Blaz.

Writing Orange Add-ons

We officially supported add-ons in Orange 2.6. You should start by checking the list of available add-ons. We pull those automatically from the PyPi, which is our preferred distribution channel. Try to install an add-on by either:

  • writing “pip install <add-on name>” in the terminal or
  • from the Orange Canvas GUI. Select “Options / Add-ons…” in the menu.

Everything should just work. Writing add-ons is as easy as writing your own Orange Widgets or Orange Scripts. Just follow this tutorial and you will have your brand-new Orange add-on on PyPi in no time (an hour at most).

Orange Add Ons

New scripting tutorial

Orange just got a new, completely rewritten scripting tutorial. The tutorial uses Orange class hierarchy as introduced for version 2.5. The tutorial is supposed to be a gentle introduction in Orange scripting. It includes many examples, from really simple ones to those more complex. To give you a hint about the later, here is the code for learner with feature subset selection from:

class SmallLearner(Orange.classification.PyLearner):
    def __init__(self, base_learner=Orange.classification.bayes.NaiveLearner,
                 name='small', m=5):
        self.name = name
        self.m   = m
        self.base_learner = base_learner

    def __call__(self, data, weight=None):
        gain = Orange.feature.scoring.InfoGain()
        m = min(self.m, len(data.domain.features))
        best = [f for _, f in sorted((gain(x, data), x) \
          for x in data.domain.features)[-m:]]
        domain = Orange.data.Domain(best + [data.domain.class_var])

        model = self.base_learner(Orange.data.Table(domain, data), weight)
        return Orange.classification.PyClassifier(classifier=model, name=self.name)

The tutorial was first written for Python 2.3. Since, Python and Orange have changed a lot. And so did I. Most of the for loops have become one-liners, list and dictionary comprehension have become a must, and many new and great libraries have emerged. The (boring) tutorial code that used to read

c = [0] * len(data.domain.classVar.values)
for e in data:
    c[int(e.getclass())] += 1
print "Instances: ", len(data), "total",
r = [0.] * len(c)
for i in range(len(c)):
    r[i] = c[i] * 100. / len(data)
for i in range(len(data.domain.classVar.values)):
    print ", %d(%4.1f%s) with class %s" % 
        (c[i], r[i], '%', data.domain.classVar.values[i]),
print

is now replaced with

print Counter(str(d.get_class()) for d in data)

Ok. Pretty print is missing, but that, if not in the same line, could be done in another one.

For now, the tutorial focuses on data input and output, classification and regression. We plan to use other sections, but you can also give us a hint if there are any you would wish to be included.