Learners in Python

We’ve already written about classifying instances in Python. However, it’s always nice to have a comprehensive list of classifiers and a step-by-step procedure at hand.

 

TRAINING THE CLASSIFIER

We start with simply importing Orange module into Python and loading our data set.

>>> import Orange
>>> data = Orange.data.Table("titanic")

We are using ‘titanic.tab’ data. You can load any data set you want, but it does have to have a categorical class variable (for numeric targets use regression). Now we want to train our classifier.

>>> learner = Orange.classification.LogisticRegressionLearner()
>>> classifier = learner(data)
>>> classifier(data[0])

Python returns the index of the value, as usual.

array[0.]

To check what’s in the class variable we print:

>>>print("Name of the variable: ", data.domain.class_var.name)
>>>print("Class values: ", data.domain.class_var.values)
>>>print("Value of our instance: ", data.domain.class_var.values[0])

Name of the variable: survived
Class values: no, yes
Value of our instance: no

 

PREDICTIONS

If you want to get predictions for the entire data set, just give the classifier the entire data set.

>>> classifier(data)

array[0, 0, 0, ..., 1, 1, 1]

If we want to append predictions to the data table, first use classifier on the data, then create a new domain with an additional meta attribute and finally form a new data table with appended predictions:

svm = classifier(data)

new_domain = Orange.data.Domain(data.domain.attributes, data.domain.class_vars, [data.domain.class_var])

table2 = Orange.data.Table(new_domain, data.X, data.Y, svm.reshape(-1, 1))

We use .reshape to transform vector data into a reshaped array. Then we print out the data.

print(table2)

 

PARAMETERS

Want to use another classifier? The procedure is the same, simply use:

Orange.classification.<algorithm-name>()

For most classifiers, you can set a whole range of parameters. Logistic Regression, for example, uses the following:

learner = Orange.classification.LogisticRegressionLearner(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, preprocessors=None)

To check the parameters for the classifier, use:

print(Orange.classification.SVMLearner())

 

PROBABILITIES

Another thing you can check with classifiers are the probabilities.

classifier(data[0], Orange.classification.Model.ValueProbs)

>>> (array([ 0.]), array([[ 1.,  0.]]))

The first array is the value for your selected instance (data[0]), while the second array contains probabilities for class values (probability for ‘no’ is 1 and for ‘yes’ 0).

 

CLASSIFIERS

And because we care about you, we’re giving you here a full list of classifier names:

LogisticRegressionLearner()

NaiveBayesLearner()

KNNLearner()

TreeLearner()

MajorityLearner()

RandomForestLearner()

SVMLearner()

 

For other learners, you can find all the parameters and descriptions in the documentation.

 

Classifying instances with Orange in Python

Last week we showed you how to create your own data table in Python shell. Now we’re going to take you a step further and show you how to easily classify data with Orange.

First we’re going to create a new data table with 10 fruits as our instances.

import Orange
from Orange.data import *

color = DiscreteVariable("color", values=["orange", "green", "yellow"])
calories = ContinuousVariable("calories")
fiber = ContinuousVariable("fiber")
fruit = DiscreteVariable("fruit", values=["orange", "apple", "peach"])

domain = Domain([color, calories, fiber], class_vars=fruit)

data=Table(domain, [
["green", 4, 1.2, "apple"], 
["orange", 5, 1.1, "orange"],
["yellow", 4, 1.0, "peach"],
["orange", 4, 1.1, "orange"],
["yellow", 4, 1.1,"peach"],
["green", 5, 1.3, "apple"],
["green", 4, 1.3, "apple"],
["orange", 5, 1.0, "orange"],
["yellow", 4.5, 1.3, "peach"],
["green", 5, 1.0, "orange"]])

print(data)

Now we have to select a model for classification. Among the many learners in Orange library, we decided to use the Tree Learner for this example. Since we’re dealing with fruits, we thought it’s only appropriate. 🙂

Let’s create a learning algorithm and use it to induce the classifier from the data.

tree_learner = Orange.classification.TreeLearner()
tree = tree_learner(data)

Now we can predict what variety a green fruit with 3.5 calories and 2g of fiber is with the help of our model. To do this, simply call the model and use a list of new data as argument.

print(tree(["green", 3.5, 2]))

Python returns index as a result:

1

To check the index, we can call class variable values with the corresponding index:

domain.class_var.values[1]

Final result:

"apple"

You can use your own data set to see how this model works for different data types. Let us know how it goes! 🙂

Creating a new data table in Orange through Python

IMPORT DATA

 

One of the first tasks in Orange data analysis is of course loading your data. If you are using Orange through Python, this is as easy as riding a bike:

import Orange
data = Orange.data.Table(“iris”)
print (data)

This will return a neat data table of the famous Iris data set in the console.

 

CREATE YOUR OWN DATA TABLE

 

What if you want to create your own data table from scratch? Even this is surprisingly simple. First, import the Orange data library.

from Orange.data import *

 

Set all the attributes you wish to see in your data table. For discrete attributes call DiscreteVariable and set the name and the possible values, while for a continuous variable call ContinuousVariable and set only the attribute name.

color = DiscreteVariable(“color”, values=[“orange”, “green”, “yellow”])

calories = ContinuousVariable(“calories”)

fiber = ContinuousVariable(“fiber”)]

fruit = DiscreteVariable("fruit”, values=[”orange", “apple”, “peach”])

 

Then set the domain for your data table. See how we set class variable with class_vars?

domain = Domain([color, calories, fiber], class_vars=fruit)

 

Time to input your data!

data = Table(domain, [

[“green”, 4, 1.2, “apple”],

["orange", 5, 1.1, "orange"],

["yellow", 4, 1.0, "peach"]])

 

And now print what you have created!

print(data)

 

One final step:

Table.save(table, "fruit.tab")

 

Your data is safely stored to your computer (in the Python folder)! Good job!

Computing joint entropy (in Python)

How I wrote a beautiful, general, and super fast joint entropy method (in Python).

def entropy(*X):
    return = np.sum(-p * np.log2(p) if p > 0 else 0 for p in
        (np.mean(reduce(np.logical_and, (predictions == c for predictions, c in zip(X, classes))))
            for classes in itertools.product(*[set(x) for x in X])))

I started with the method to compute the entropy of a single variable. Input is a numpy array with discrete values (either integers or strings).

import numpy as np

def entropy(X):
    probs = [np.mean(X == c) for c in set(X)]
    return np.sum(-p * np.log2(p) for p in probs)

In my next version I extended it to compute the joint entropy of two variables:

def entropy(X, Y):
    probs = []
    for c1 in set(X):
        for c2 in set(Y):
            probs.append(np.mean(np.logical_and(X == c1, Y == c2)))

    return np.sum(-p * np.log2(p) for p in probs)

Now wait a minute, it looks like we have a recursion here. I couldn’t stop myself of writing en extended general function to compute the joint entropy of n variables.

def entropy(*X, **kwargs):
    predictions = parse_arg(X[0])
    H = kwargs["H"] if "H" in kwargs else 0
    v = kwargs["v"] if "v" in kwargs else np.array([True] * len(predictions))

    for c in set(predictions):
        if len(X) > 1:
            H = entropy(*X[1:], v=np.logical_and(v, predictions == c), H=H)
        else:
            p = np.mean(np.logical_and(v, predictions == c))
            H += -p * np.log2(p) if p > 0 else 0
    return H

It was the ugliest recursive function I’ve ever written. I couldn’t stop coding, I was hooked. Besides, this method was slow as hell and I need a faster version for my reasearch. I need my data tommorow, not next month. I googled if Python has something that would help me deal with the recursive part. I fould this great method: itertools.product, I’s just what we need. It takes lists and returns a cartesian product of their values. It’s the “nested for loops” in one function.

def entropy(*X):
    n_insctances = len(X[0])
    H = 0
    for classes in itertools.product(*[set(x) for x in X]):
        v = np.array([True] * n_insctances)
        for predictions, c in zip(X, classes):
            v = np.logical_and(v, predictions == c)
        p = np.mean(v)
        H += -p * np.log2(p) if p > 0 else 0
    return H

No resursion, but still slow. It’s time to rewrite loops to the Python-like style. As a sharp eye has already noticed, the second for loop with the np.logical_and inside is perfect for the reduce method.

def entropy(*X):
    n_insctances = len(X[0])
    H = 0
    for classes in itertools.product(*[set(x) for x in X]):
        v = reduce(np.logical_and, (predictions, c for predictions, c in zip(X, classes)))
        p = np.mean(v)
        H += -p * np.log2(p) if p > 0 else 0
    return H

Now, we have to remove just one more list comprehension and we have a beautiful, general, and super fast joint etropy method.

def entropy(*X):
    return = np.sum(-p * np.log2(p) if p > 0 else 0 for p in
        (np.mean(reduce(np.logical_and, (predictions == c for predictions, c in zip(X, classes))))
            for classes in itertools.product(*[set(x) for x in X])))

Debian packages support multiple Python versions now

We have created Debian packages for multiple Python versions. This means that they work now with both Python 2.6 and 2.7 out of the box, or if you compile them manually, with any (supported) version you have installed on your (Debian-based) system.

Practically, this means that now you can install them without manual compiling on current Debian and Ubuntu systems. Give it a try, add our Debian package repository, apt-get install python-orange for Orange library/modules and/or orange-canvas for GUI. If you install the later package, type orange in the terminal and Orange canvas will pop-up.