Making Predictions

One of the cool things about being a data scientist is being able to predict. That is, predict before we know the actual outcome. I am not talking about verifying your favorite classification algorithm here, and I am not talking about cross-validation or classification accuracies or AUC or anything like that. I am talking about the good old prediction. This is where our very own Predictions widget comes to help.

predictive analytics
Predictions workflow.

 

We will be exploring the Iris data set again, but we’re going to add a little twist to it. Since we’ve worked so much with it already, I’m sure you know all about this data. But now we got three new flowers in the office and of course there’s no label attached to tell us what species of Iris these flowers are. [sigh….] Obviously, we will be measuring petals and sepals and contrasting the results with our data.

predictive analytics
Our new data on three flowers. We have used Google Sheets to enter the data and the copied the sharable link and pasted the link to the File widget.

 

But surely you don’t want to go through all 150 flowers to properly match the three new Irises? So instead, let’s first train a model on the existing data set. We connect the File widget to the chosen classifier (we went with Classification Tree this time) and feed the results into Predictions. Now we write down the measurements for our new flowers into Google Sheets (just like above), load it into Orange with a new File widget and input the fresh data into Predictions. We can observe the predicted class directly in the widget itself.

predictive analytics
Predictions made by classification tree.

 

In the left part of the visualization we have the input data set (our measurements) and in the right part the predictions made with classification tree. By default you see probabilities for all three class values and the predicted class. You can of course use other classifiers as well – it would probably make sense to first evaluate classifiers on the existing data set, find the best one for your and then use it on the new data.

 

Learners in Python

We’ve already written about classifying instances in Python. However, it’s always nice to have a comprehensive list of classifiers and a step-by-step procedure at hand.

 

TRAINING THE CLASSIFIER

We start with simply importing Orange module into Python and loading our data set.

>>> import Orange
>>> data = Orange.data.Table("titanic")

We are using ‘titanic.tab’ data. You can load any data set you want, but it does have to have a categorical class variable (for numeric targets use regression). Now we want to train our classifier.

>>> learner = Orange.classification.LogisticRegressionLearner()
>>> classifier = learner(data)
>>> classifier(data[0])

Python returns the index of the value, as usual.

array[0.]

To check what’s in the class variable we print:

>>>print("Name of the variable: ", data.domain.class_var.name)
>>>print("Class values: ", data.domain.class_var.values)
>>>print("Value of our instance: ", data.domain.class_var.values[0])

Name of the variable: survived
Class values: no, yes
Value of our instance: no

 

PREDICTIONS

If you want to get predictions for the entire data set, just give the classifier the entire data set.

>>> classifier(data)

array[0, 0, 0, ..., 1, 1, 1]

If we want to append predictions to the data table, first use classifier on the data, then create a new domain with an additional meta attribute and finally form a new data table with appended predictions:

svm = classifier(data)

new_domain = Orange.data.Domain(data.domain.attributes, data.domain.class_vars, [data.domain.class_var])

table2 = Orange.data.Table(new_domain, data.X, data.Y, svm.reshape(-1, 1))

We use .reshape to transform vector data into a reshaped array. Then we print out the data.

print(table2)

 

PARAMETERS

Want to use another classifier? The procedure is the same, simply use:

Orange.classification.<algorithm-name>()

For most classifiers, you can set a whole range of parameters. Logistic Regression, for example, uses the following:

learner = Orange.classification.LogisticRegressionLearner(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, preprocessors=None)

To check the parameters for the classifier, use:

print(Orange.classification.SVMLearner())

 

PROBABILITIES

Another thing you can check with classifiers are the probabilities.

classifier(data[0], Orange.classification.Model.ValueProbs)

>>> (array([ 0.]), array([[ 1.,  0.]]))

The first array is the value for your selected instance (data[0]), while the second array contains probabilities for class values (probability for ‘no’ is 1 and for ‘yes’ 0).

 

CLASSIFIERS

And because we care about you, we’re giving you here a full list of classifier names:

LogisticRegressionLearner()

NaiveBayesLearner()

KNNLearner()

TreeLearner()

MajorityLearner()

RandomForestLearner()

SVMLearner()

 

For other learners, you can find all the parameters and descriptions in the documentation.

 

Orange team wins JRS 2012 Data Mining Competition

Lead by Jure Žbontar, the team from University of Ljubljana wins over 126 other entrants in an international competition in predictive data analytics.

Jure’s team consisted of several Orange developers and computer science students: Miha Zidar, Blaž Zupan, Gregor Majcen, Marinka Žitnik in Matic Potočnik. To win, the team had to predict topics for 10.000 MedLine documents that were represented with over 25.000 algorithmically derived numerical features. Given was training set of another 10.000 documents in the same representation but each labeled with a set of topics. From the training set the task was to develop a model to predict labels for documents in the test set. A particular challenge was guessing the right number of topics to be associated with the documents, as these, at least in the training set, varied from one to a dozen.

JRS 2012 is just one in a series of competitions recently organized on servers such as TunedIT and Kaggle. The price for winning was $1000 and a trip to Joint Rough Set Symposium in Chengdu, China, to present a winning strategy and developed data mining techiques.

JRS-2012 Leaderboard