Random decisions behind your back

When Orange builds a decision tree, candidate attributes are evaluated and the best candidate is chosen. But what if two or more share the first place? Most machine learning systems don’t care about it and always take the first, which is unfair and, besides, has strange effects: the induced model and, consequentially, its accuracy depends upon the order of attributes. Which shouldn’t be.

This is not an isolated problem. Another instance is when a classifier has to choose between two equally probable classes when there is no additional information (such as classification costs) to help make the prediction. Or selecting random reference examples when computing ReliefF. Returning a modus of a distribution with two or more competing values…

The old solution was to make a random selection in such cases. Take a random class (out of the most probable, of course), random attribute, random examples… Although theoretically correct, it comes with a price: the only way to ensure repeatability of experiments is by setting the global random generator, which is not a good practice in component-based systems.

What Orange does now is more cunning. When, for instance, choosing between n equally probable classes, Orange computes something like a hash value from the example to be classified. Its remainder at division by n is then used to select the class. Thus, the class will be random, but always the same for same example.

A similar trick is used elsewhere. To choose an attribute when building a tree, it simply divides the number of learning examples at that node by the number of candidate attributes and the remainder is used again.

When more random numbers are needed, for instance for selecting m random reference examples for computing ReliefF, the number of examples is used for a random seed for a temporary random generator.

To conclude: Orange will sometimes make decisions that will look random. The reason for this is that it is more fair than most of machine learning systems that pick the first (or the last) candidate. But whatever decision is taken, it will be the same if you run the program twice. The message is thus: be aware that this is happenning, but don’t care about it.

New in Orange: Partial least squares regression

Partial least squares regression is a regression technique which supports multiple response variables. PLS regression is very popular in areas such as bioinformatics, chemometrics etc. where the number of observations is usually less than the number of measured variables and where there exists multicollinearity among the predictor variables. In such situations, standard regression techniques would usually fail. The PLS regression is now available in Orange (see documentation)!

You can use PLS regression model on single-target or multi-target data sets. Simply load the data set multitarget-synthetic.tab and see that it contains three predictor variables and four responses using this code.

data = Orange.data.Table("multitarget-synthetic.tab")
print "Input variables:"
print data.domain.features
print "Response variables:"
print data.domain.class_vars


Input variables:
<FloatVariable 'X1', FloatVariable 'X2', FloatVariable 'X3'>
Response variables:
<FloatVariable 'Y1', FloatVariable 'Y2', FloatVariable 'Y3', FloatVariable 'Y4'>

As you can see, all variables in this data set are continuous. The PLS regression is intended forsuch situations although it can be used for discrete input variables as well (using 0-1 continuation). Currently, discrete response variables are not yet supported.

Let’s try to fit the PLS regression model on our data set.

learner = Orange.multitarget.pls.PLSRegressionLearner()
classifier = learner(data)

The classifier can be now used to predict values of the four responses based onthree predictors. Let’s see how it manages this task on the first data instance.

actual = data[0].get_classes()
predicted = classifier(data[0]) 

print "Actual", "Predicted"
for a, p in zip(actual, predicted):
    print "%6.3f %6.3f" % (a,p)


Actual Predicted
 0.490  0.613
 1.237  0.826
 1.808  1.084
 0.422  0.534

To test the usefulness of PLS as a multi-target method let’s compare it to a single-target method – linear regression. We did this by comparing Root mean squared error (RMSE) of predicted values for a single response variable. We constructed synthetic data sets and performed the RMSE analysis using this script. The results can be seen in the following output:

    Training set sizes      5     10     20     50    100    200    500   1000
Linear (single-target) 0.5769 0.3128 0.2703 0.2529 0.2493 0.2446 0.2436 0.2442
    PLS (multi-target) 0.3663 0.2955 0.2623 0.2517 0.2487 0.2447 0.2441 0.2448

We can see that PLS regression outperforms linear regression when the number of training instances is low. Such situations (low number of instances compared to high number of variables) are quite common when analyzing data sets in bioinformatics. However, with increasing number of training instances, the advantages of PLS regression diminish.