Python Script: Managing Data on the Fly

Python Script is this mysterious widget most people don’t know how to use, even those versed in Python. Python Script is the widget that supplements Orange functionalities with (almost) everything that Python can offer. And it’s time we unveil some of its functionalities with a simple example.

Example: Batch Transform the Data

There might be a time when you need to apply a function to all your attributes. Say you wish to log-transform their values, as it is common in gene expression data. In theory, you could do this with Feature Constructor, where you would log-transform every attribute individually. Sounds laborious? It’s because it is. Why else we have computers if not to reduce manual labor for certain tasks? Let’s do it the fast way – with Python Script.

First, open File widget and load geo-gds360.tab from Browse documentation data sets. This data set has 9485 features, so imagine having to transform each feature individually.

Instead, we will connect Python Script to File and use a simple script to apply the same transformation to all attributes.

import numpy as np
from Orange.data import Table

new_X = np.log(in_data.X)
out_data = Table(in_data.domain, new_X, in_data.Y, in_data.metas)

This is really simple. Use in_data.X, which accesses all features in the data set, to transform the data with np.log (or any other numpy function). Set out_data to new_X and, voila, the transformed data is on the output. In a few lines we have instantly handled all 9485 features.

You can inspect the data before and after transformation in a Data Table widget.

Original data.
Log-transformed data.

 

This is it. Now we can do our standard analysis on the transformed data. Even better! We can save our script and use it in Python Script widget any time we want.

For your convenience I have already added the Log Attributes Script, so you can download and use it instantly!

Have a more interesting example with Python Script? We’d love to hear about it!

Data Mining Course at Higher School of Economics, Moscow

Janez and I have recently returned from a two-week stay in Moscow, Russian Federation, where we were teaching data mining to MA students of Applied Statistics. This is a new Master’s course that attracts the best students from different backgrounds and teaches them statistical methods for work in the industry.

It was a real pleasure working at HSE. The students were proactive by asking questions and really challenged us to do our best.

One of the things we did was compute minimum cost of misclassifications. The story goes like this. Sara is a doctor and has data on 303 patients with heart disease (Orange’s heart-disease.tab data set). She used some classifiers and now has to decide how many patients to send for further tests. Naive Bayes classifier, for example, returned probabilities of a patient being sick (column Naive Bayes 1). For each threshold in probabilites, she will compute how many false positives (patients declared sick when healthy) and how many false negatives (patients declared healthy when sick) a classifiers returns. Each mistake is associated with a cost. Now she wants to find out, how many patients to send for tests (what probability threshold to choose) so that her cost is the lowest.

First, import all the libraries we will need:

import matplotlib.pyplot as plt
import numpy as np

from Orange.data import Table
from Orange.classification import NaiveBayesLearner, TreeLearner
from Orange.evaluation import CrossValidation

Then load heart disease data (and print a sample).

heart = Table("heart_disease")
print(heart[:5])

Now, train classifiers and select probabilities of Naive Bayes for a patient being sick.

scores = CrossValidation(heart, [NaiveBayesLearner(), TreeLearner()])

#take probabilites of class 1 (sick) of NaiveBayesLearner
p1 = scores.probabilities[0][:, 1]

#take actual class values
y = scores.actual

#cost of false positive (patient classified as sick when healthy)
fp_cost = 500

#cost of false negative (patient classified as healthy when sick)
fn_cost = 800

Set counts, where we declare 0 patients being sick (threshold >1).

fp = 0
#start with threshold above 1 (no one is sick)
fn = np.sum(y)

For each threshold, compute the cost associated with each type of mistake.

ps = []
costs = []

#compute costs of classifying i patients as sick
for i in np.argsort(p1)[::-1]:
    if y[i] == 0:
        fp += 1
    else:
        fn -= 1
    ps.append(p1[i])
    costs.append(fp * fp_cost + fn * fn_cost)

In the end, we get a list of probability thresholds and associated costs. Now let us find the minimum cost and its probability of a patient being sick.

costs = np.array(costs)
#find probability of a patient being sick at lowest cost
print(ps[costs.argmin()])

This means the threshold that minimizes our cost for a given classifier is 0.620655. Sara would send all the patients with a probability of being sick higher or equal than 0.620655  for further tests.

At the end, we can also plot the cost to patients sent curve.

fig, ax = plt.subplots()
plt.plot(ps, costs)
ax.set_xlabel('Patients sent')
ax.set_ylabel('Cost')

You can download the IPython Notebook here: Minimum Cost.

Learners in Python

We’ve already written about classifying instances in Python. However, it’s always nice to have a comprehensive list of classifiers and a step-by-step procedure at hand.

 

TRAINING THE CLASSIFIER

We start with simply importing Orange module into Python and loading our data set.

>>> import Orange
>>> data = Orange.data.Table("titanic")

We are using ‘titanic.tab’ data. You can load any data set you want, but it does have to have a categorical class variable (for numeric targets use regression). Now we want to train our classifier.

>>> learner = Orange.classification.LogisticRegressionLearner()
>>> classifier = learner(data)
>>> classifier(data[0])

Python returns the index of the value, as usual.

array[0.]

To check what’s in the class variable we print:

>>>print("Name of the variable: ", data.domain.class_var.name)
>>>print("Class values: ", data.domain.class_var.values)
>>>print("Value of our instance: ", data.domain.class_var.values[0])

Name of the variable: survived
Class values: no, yes
Value of our instance: no

 

PREDICTIONS

If you want to get predictions for the entire data set, just give the classifier the entire data set.

>>> classifier(data)

array[0, 0, 0, ..., 1, 1, 1]

If we want to append predictions to the data table, first use classifier on the data, then create a new domain with an additional meta attribute and finally form a new data table with appended predictions:

svm = classifier(data)

new_domain = Orange.data.Domain(data.domain.attributes, data.domain.class_vars, [data.domain.class_var])

table2 = Orange.data.Table(new_domain, data.X, data.Y, svm.reshape(-1, 1))

We use .reshape to transform vector data into a reshaped array. Then we print out the data.

print(table2)

 

PARAMETERS

Want to use another classifier? The procedure is the same, simply use:

Orange.classification.<algorithm-name>()

For most classifiers, you can set a whole range of parameters. Logistic Regression, for example, uses the following:

learner = Orange.classification.LogisticRegressionLearner(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, preprocessors=None)

To check the parameters for the classifier, use:

print(Orange.classification.SVMLearner())

 

PROBABILITIES

Another thing you can check with classifiers are the probabilities.

classifier(data[0], Orange.classification.Model.ValueProbs)

>>> (array([ 0.]), array([[ 1.,  0.]]))

The first array is the value for your selected instance (data[0]), while the second array contains probabilities for class values (probability for ‘no’ is 1 and for ‘yes’ 0).

 

CLASSIFIERS

And because we care about you, we’re giving you here a full list of classifier names:

LogisticRegressionLearner()

NaiveBayesLearner()

KNNLearner()

TreeLearner()

MajorityLearner()

RandomForestLearner()

SVMLearner()

 

For other learners, you can find all the parameters and descriptions in the documentation.

 

Classifying instances with Orange in Python

Last week we showed you how to create your own data table in Python shell. Now we’re going to take you a step further and show you how to easily classify data with Orange.

First we’re going to create a new data table with 10 fruits as our instances.

import Orange
from Orange.data import *

color = DiscreteVariable("color", values=["orange", "green", "yellow"])
calories = ContinuousVariable("calories")
fiber = ContinuousVariable("fiber")
fruit = DiscreteVariable("fruit", values=["orange", "apple", "peach"])

domain = Domain([color, calories, fiber], class_vars=fruit)

data=Table(domain, [
["green", 4, 1.2, "apple"], 
["orange", 5, 1.1, "orange"],
["yellow", 4, 1.0, "peach"],
["orange", 4, 1.1, "orange"],
["yellow", 4, 1.1,"peach"],
["green", 5, 1.3, "apple"],
["green", 4, 1.3, "apple"],
["orange", 5, 1.0, "orange"],
["yellow", 4.5, 1.3, "peach"],
["green", 5, 1.0, "orange"]])

print(data)

Now we have to select a model for classification. Among the many learners in Orange library, we decided to use the Tree Learner for this example. Since we’re dealing with fruits, we thought it’s only appropriate. 🙂

Let’s create a learning algorithm and use it to induce the classifier from the data.

tree_learner = Orange.classification.TreeLearner()
tree = tree_learner(data)

Now we can predict what variety a green fruit with 3.5 calories and 2g of fiber is with the help of our model. To do this, simply call the model and use a list of new data as argument.

print(tree(["green", 3.5, 2]))

Python returns index as a result:

1

To check the index, we can call class variable values with the corresponding index:

domain.class_var.values[1]

Final result:

"apple"

You can use your own data set to see how this model works for different data types. Let us know how it goes! 🙂

Creating a new data table in Orange through Python

IMPORT DATA

 

One of the first tasks in Orange data analysis is of course loading your data. If you are using Orange through Python, this is as easy as riding a bike:

import Orange
data = Orange.data.Table(“iris”)
print (data)

This will return a neat data table of the famous Iris data set in the console.

 

CREATE YOUR OWN DATA TABLE

 

What if you want to create your own data table from scratch? Even this is surprisingly simple. First, import the Orange data library.

from Orange.data import *

 

Set all the attributes you wish to see in your data table. For discrete attributes call DiscreteVariable and set the name and the possible values, while for a continuous variable call ContinuousVariable and set only the attribute name.

color = DiscreteVariable(“color”, values=[“orange”, “green”, “yellow”])

calories = ContinuousVariable(“calories”)

fiber = ContinuousVariable(“fiber”)]

fruit = DiscreteVariable("fruit”, values=[”orange", “apple”, “peach”])

 

Then set the domain for your data table. See how we set class variable with class_vars?

domain = Domain([color, calories, fiber], class_vars=fruit)

 

Time to input your data!

data = Table(domain, [

[“green”, 4, 1.2, “apple”],

["orange", 5, 1.1, "orange"],

["yellow", 4, 1.0, "peach"]])

 

And now print what you have created!

print(data)

 

One final step:

Table.save(table, "fruit.tab")

 

Your data is safely stored to your computer (in the Python folder)! Good job!

Computing joint entropy (in Python)

How I wrote a beautiful, general, and super fast joint entropy method (in Python).

def entropy(*X):
    return = np.sum(-p * np.log2(p) if p > 0 else 0 for p in
        (np.mean(reduce(np.logical_and, (predictions == c for predictions, c in zip(X, classes))))
            for classes in itertools.product(*[set(x) for x in X])))

I started with the method to compute the entropy of a single variable. Input is a numpy array with discrete values (either integers or strings).

import numpy as np

def entropy(X):
    probs = [np.mean(X == c) for c in set(X)]
    return np.sum(-p * np.log2(p) for p in probs)

In my next version I extended it to compute the joint entropy of two variables:

def entropy(X, Y):
    probs = []
    for c1 in set(X):
        for c2 in set(Y):
            probs.append(np.mean(np.logical_and(X == c1, Y == c2)))

    return np.sum(-p * np.log2(p) for p in probs)

Now wait a minute, it looks like we have a recursion here. I couldn’t stop myself of writing en extended general function to compute the joint entropy of n variables.

def entropy(*X, **kwargs):
    predictions = parse_arg(X[0])
    H = kwargs["H"] if "H" in kwargs else 0
    v = kwargs["v"] if "v" in kwargs else np.array([True] * len(predictions))

    for c in set(predictions):
        if len(X) > 1:
            H = entropy(*X[1:], v=np.logical_and(v, predictions == c), H=H)
        else:
            p = np.mean(np.logical_and(v, predictions == c))
            H += -p * np.log2(p) if p > 0 else 0
    return H

It was the ugliest recursive function I’ve ever written. I couldn’t stop coding, I was hooked. Besides, this method was slow as hell and I need a faster version for my reasearch. I need my data tommorow, not next month. I googled if Python has something that would help me deal with the recursive part. I fould this great method: itertools.product, I’s just what we need. It takes lists and returns a cartesian product of their values. It’s the “nested for loops” in one function.

def entropy(*X):
    n_insctances = len(X[0])
    H = 0
    for classes in itertools.product(*[set(x) for x in X]):
        v = np.array([True] * n_insctances)
        for predictions, c in zip(X, classes):
            v = np.logical_and(v, predictions == c)
        p = np.mean(v)
        H += -p * np.log2(p) if p > 0 else 0
    return H

No resursion, but still slow. It’s time to rewrite loops to the Python-like style. As a sharp eye has already noticed, the second for loop with the np.logical_and inside is perfect for the reduce method.

def entropy(*X):
    n_insctances = len(X[0])
    H = 0
    for classes in itertools.product(*[set(x) for x in X]):
        v = reduce(np.logical_and, (predictions, c for predictions, c in zip(X, classes)))
        p = np.mean(v)
        H += -p * np.log2(p) if p > 0 else 0
    return H

Now, we have to remove just one more list comprehension and we have a beautiful, general, and super fast joint etropy method.

def entropy(*X):
    return = np.sum(-p * np.log2(p) if p > 0 else 0 for p in
        (np.mean(reduce(np.logical_and, (predictions == c for predictions, c in zip(X, classes))))
            for classes in itertools.product(*[set(x) for x in X])))

Debian packages support multiple Python versions now

We have created Debian packages for multiple Python versions. This means that they work now with both Python 2.6 and 2.7 out of the box, or if you compile them manually, with any (supported) version you have installed on your (Debian-based) system.

Practically, this means that now you can install them without manual compiling on current Debian and Ubuntu systems. Give it a try, add our Debian package repository, apt-get install python-orange for Orange library/modules and/or orange-canvas for GUI. If you install the later package, type orange in the terminal and Orange canvas will pop-up.