Visualizing Gradient Descent

This is a guest blog from the Google Summer of Code project.

 

Gradient Descent was implemented as a part of my Google Summer of Code project and it is available in the Orange3-Educational add-on. It simulates gradient descent for either Logistic or Linear regression, depending on the type of the input data. Gradient descent is iterative approach to optimize model parameters that minimize the cost function. In machine learning, the cost function corresponds to prediction error when the model is used on the training data set.

Gradient Descent widget takes data on input and outputs the model and its coefficients.

gradient-descent-flow

The widget displays the value of the cost function given two parameters of the model. For linear regression, we consider feature from the training set with the parameters being the intercept and the slope. For logistic regression, the widget considers two feature and their associated multiplicative parameters, setting the intercept to zero. Screenshot bellow shows gradient descent on a Iris data set, where we consider petal length and sepal width on the input and predict the probability that iris comes from the family of Iris versicolor.

gradient-descent1-stamped

  1. The type of the model used (either Logistic regression or Linear regression)
  2. Input features (one for X and one for Y axis) and the target class
  3. Learning rate is the step size of the gradient descent
  4. In a single iteration step, stochastic approach considers only a single data instance (instead of entire training set). Convergence in terms of iterations steps is slower, and we can instruct the widget to display the progress of optimization only after given number of steps (Step size)
  5. Step through the algorithm (steps can be reverted with step back button)
  6. Run optimization until convergence

 

Following shows gradient descent for linear regression using The Boston Housing Data Set when trying to predict the median value of a house given its age.

gradient-descent-age

On the left we use the regular and on the right the stochastic gradient descent. While the regular descent goes straight to the target, the path of stochastic is not as smooth.

We can use the widget to simulate some dangerous, unwanted behavior of gradient descent. The following screenshots show two extreme cases with too high learning rate where optimization function never converges, and a low learning rate where convergence is painfully slow.

gradient-descent-extrems

The two problems as illustrated above are the reason that many implementations of numerical optimization use adaptive learning rates. We can simulate this in the widget by modifying the learning rate for each step of the optimization.

Making recommendations

This is a guest blog from the Google Summer of Code project.

 

Recommender systems are everywhere, we can find them on YouTube, Amazon, Netflix, iTunes,… This is because they are crucial component in a competitive retail services.

How can I know what you may like if I have almost no information about you? The answer: taking Collaborative filtering (CF) approaches. Basically, this means to combine all the little knowledge we have about users and/or items in order to build a grid of knowledge with which we make recommendation.

To help you with that, Biolab has written Orange3-Recommendation – an add-on for Orange3 to train recommendation models, cross-validate them and make predictions.

Input data

First things first. Orange3-Recommendation can read files in native tab-delimited format, or can load data from any of the major standard spreadsheet file type, like CSV and Excel. Native format starts with a header row with feature (column) names. Second header row gives the attribute type, which can be continuous, discrete, string or time. The third header line contains meta information to identify dependent features (class), irrelevant features (ignore) or meta features (meta).

Here are the first few lines from a data set:

    tid      user        movie       score
    string   discrete    discrete    continuous
    meta     row=1       col=1       class
    1        Breza       HarrySally  2
    2        Dana        Cvetje      5
    3        Cene        Prometheus  5
    4        Ksenija     HarrySally  4
    5        Albert      Matrix      4
    ...

The third row is mandatory in this kind of datasets*, in order to know which attributes correspond to the users (row=1) and which ones to the items (col=1). For the case of big datasets, users and items must be specified as continuous attributes due to efficiency issues. (*Note: If the meta attributes row or col, some simple heuristics will be applied: users=column 0, items=column 1, class=last column)

Here are the first few lines from a data set :

    user            movie         score         tid
    continuous      continuous    continuous    time
    row=1           col=1         class         meta
    196             242           3             881250949
    186             302           3             891717742
    22              377           1             878887116
    244             51            2             880606923
    166             346           1             886397596
    298             474           4             884182806
    ...

Training a model

This step is pretty simple. To train a model we have to load the data as is described above and connect it to the learner. (Don’t forget to click apply)

data to brismf

If the model uses side information, we only need to add an extra file.

TrustSVD

In addition, we can set the parameters of our model by double-clicking it:

Screen Shot 2016-08-22 at 15.49.56

By using a fixed seed, we make random numbers predictable. Therefore, this feature is useful if we want to compare results in a deterministic way.

Cross-validation

This is as simple as it seems. The only thing to point out is that side information must be connected to the model.

cv-recommendation

 

Still, cross-validation is a robust way to see how our model is performing. I consider that it’s a good idea to check how our model performs with respect to the baseline. This presents a negligible overload* in our pipeline and makes our analysis more solid. (*For 1,000,000 ratings, it can take 0.027s).

We can add a baseline leaner to Test&Score and select the model we want to apply.

Baselines

Making recommendations

The prediction flow is exactly the same as in Orange3.

Recommendation-predictions

Analyzing low-rank matrices

all-rank-dis

 

Once we’ve output the low-rank matrices, we can play around the vectors in those matrices to discover hidden relations or understand the known ones. For instance, here we plot vector 1 and 2 from the item-feature matrix by simply connecting Data Table with selected instances to the widget Scatter Plot.

Visualizing vectors

Using similar approaches we can discover pretty interesting things like similarity between movies or users, how movie genres relate with each other, changes in users’ behavior, when the popularity of a movie has been raised due to a commercial campaign,… and many others.

Finally, a simple pipeline to do all of the above can be something like this:

workflow-recommendation

On the left side we connected several models to Test&Score in order to cross-validate them. Later, we trained a SVD++ model, made some predictions, got the low-rank matrices learnt by the model and plotted some vectors of the Item-feature matrix.

Analysis (Advanced users)

Here we’ve made a workflow (which can be downloaded here) to perform a really basic analysis on the results obtained through factoring the user and item feature matrices with BRISMF over the movielens100k dataset. (Note: Once downloaded, set the prepared datasets in the folder ‘orange’. Probably you’re gonna get a couple errors. Don’t worry, it’s normal. To solve it, apply the scripts sequentially but don’t forget previously to select all the rows in the related Table.)

Instead of explaining how this pipeline works, the best thing you can do is to download it and play with it.

Complex flow

 

One of the analysis you can do, is to plot the most popular movies across two first vectors of the matrix descomposition. Later, you can try to find clusters, tweak it a bit and find crossed relations (e.g. male/female Vs. action/drama).

Cluster movies

Now let’s focus on the scripting part.

Rating models

In this tutorial we are going to train a BRISMF model.

1. First we import Orange and the learner that we want to use:

import Orange
from orangecontrib.recommendation import BRISMFLearner

 

2. After that, we have to load a dataset:

data = Orange.data.Table('movielens100k.tab')

 

3. Then we set the learner parameters, and finally we train it passing the dataset as an argument (the returned value will be our model trained):

learner = BRISMFLearner(num_factors=15, num_iter=25, learning_rate=0.07, lmbda=0.1)
recommender = learner(data)

 

4. Finally, we can make predictions (in this case, for the first three pairs in the dataset):

prediction = recommender(data[:3])
print(prediction)
>>> [ 3.79505151 3.75096513 1.293013 ]

Ranking models

At this point we can try something new, let’s make recommendations for a dataset in which only binary relevance is available. For this case, CLiMF is model that will suit our needs.

import Orange
import numpy as np
from orangecontrib.recommendation import CLiMFLearner

# Load data
data = Orange.data.Table('epinions_train.tab')

# Train recommender
learner = CLiMFLearner(num_factors=10, num_iter=10, learning_rate=0.0001, lmbda=0.001)
recommender = learner(data)

# Make recommendations
recommender(X=5)
>>> [ 494,   803,   180, ..., 25520, 25507, 30815]

 

Later, we can score the model. In this case we’re using the MeanReciprocalRank:

import Orange

# Load test
dataset testdata = Orange.data.Table('epinions_test.tab') 

# Sample users 
num_users = len(recommender.U)
num_samples = min(num_users, 1000) # max. number to sample
users_sampled = np.random.choice(np.arange(num_users), num_samples) 

# Compute Mean Reciprocal Rank (MRR) 
mrr, _ = recommender.compute_mrr(data=testdata, users=users_sampled) 
print('MRR: %.4f' % mrr) 
>>> MRR: 0.3975

SGD optimizers

This add-on includes several configurations that can be used to modify the updates on the low rank matrices during the stochastic gradient descent optimization.

  • SGD: Classical SGD update.
  • Momentum: SGD with inertia.
  • Nesterov momentum: A Momentum that “looks ahead”.
  • AdaGrad: Optimizer that adapts its learning rating during the process.
  • RMSProp: “Leaky” AdaGrad.
  • AdaDelta: Extension of Adagrad that seeks to reduce its aggressive.
  • Adam: Similar to AdaGrad and RMSProp but with an exponentially decaying average of past gradients.
  • Adamax: Similar to Adam, but taking the maximum between the gradient and the velocity.

 

Do you want to learn more about this? Check our documentation!

Visualization of Classification Probabilities

This is a guest blog from the Google Summer of Code project.

 

Polynomial Classification widget is implemented as a part of my Google Summer of Code project along with other widgets in educational add-on (see my previous blog). It visualizes probabilities for two-class classification (target vs. rest) using color gradient and contour lines, and it can do so for any Orange learner.

Here is an example workflow. The data comes from the File widget. With no learner on input, the default is Logistic Regression. Widget outputs learners Coefficients, Classifier (model) and Learner.

poly-classification-flow

Polynomial Classification widget works on two continuous features only, all other features are ignored. The screenshot shows plot of classification for an Iris data set .

polynomial-classification-1-stamped

  1. Set name of the learner. This is the name of learner on output.
  2. Set features that logistic regression is performed on.
  3. Set class that is classified separately from other classes.
  4. Set the degree of a polynom that is used to transform an input data (1 means attributes are not transformed).
  5. Select whether see or not contour lines in chart. The density of contours is regulated by Contour step.

 

The classification for our case fails in separating Iris-versicolor from the other two classes. This is because logistic regression is a linear classifier, and because there is no linear combination of the chosen two attributes that would make for a good decision boundary. We can change that. Polynomial expansion adds features that are polynomial combinations of original ones. For example, if an input data contains features [a, b], polynomial expansion of degree two generates feature space [1, a, b, a2, a b, b2]. With this expansion, the classification boundary looks great.

polynomial-classification-2

 

Polynomial Classification also works well with other learners. Below we have given it a Classification Tree. This time we have painted the input data using Paint Data, a great data generator used while learning about Orange and data science. The decision boundaries for the tree are all square, a well-known limitation for tree-based learners.

poly-classification-4e

 

Polynomial expansion if high degrees may be dangerous. Following example shows overfitting when degree is five. See the two outliers, a blue one on the top and the red one at the lower right of the plot? The classifier was unnecessary able to separate the outliers from the pack, something that will become problematic when classifier will be used on the new data.

poly-classification-owerfit

Overfitting is one of the central problems in machine learning. You are welcome to read our previous blog on this problem and possible solutions.

Interactive k-Means

This is a guest blog from the Google Summer of Code project.

 

As a part of my Google Summer of Code project I started developing educational widgets and assemble them in an Educational Add-On for Orange. Educational widgets can be used by students to understand how some key data mining algorithms work and by teachers to demonstrate the working of these algorithms.

Here I describe an educational widget for interactive k-means clustering, an algorithm that splits the data into clusters by finding cluster centroids such that the distance between data points and their corresponding centroid is minimized. Number of clusters in k-means algorithm is denoted with k and has to be specified manually.

The algorithm starts by randomly positioning the centroids in the data space, and then improving their position by repetition of the following two steps:

  1. Assign each point to the closest centroid.
  2. Move centroids to the mean position of points assigned to the centroid.

The widget needs the data that can come from File widget, and outputs the information on clusters (Annotated Data) and centroids:

kmans_shema

Educational widget for k-means works finds clusters based on two continuous features only, all other features are ignored. The screenshot shows plot of an Iris data set and clustering with k=3. That is partially cheating, because we know that iris data set has three classes, so that we can check if clusters correspond well to original classes:

kmeans2-stamped

  1. Select two features that are used in k-means
  2. Set number of centroids
  3. Randomize positions of centroids
  4. Show lines between centroids and corresponding points
  5. Perform the algorithm step by step. Reassign membership connects points to nearest centroid, Recompute centroids moves centroids.
  6. Step back in the algorithm
  7. Set speed of automatic stepping
  8. Perform the whole algorithm as fast preview
  9.  Anytime we can change number of centroids with spinner or with click in desired position in the graph.

If we want to see the correspondence of clusters that are denoted by k-means and classes, we can open Data Table widget where we see that all Iris-setosas are clustered in one cluster and but there are just few Iris-versicolor that are classified is same cluster together with Iris-virginica and vice versa.

kmeans3

Interactive k-means works great in combination with Paint Data. There, we can design data sets where k-mains fails, and observe why.

kmeans-failt

We could also design data sets where k-means fails under specific initialization of centroids. Ah, I did not tell you that you can freely move the centroids and then restart the algorithm. Below we show the case of centroid initialization and how this leads to non-optimal clustering.

kmeans-f-join

Rule Induction (Part I – Scripting)

This is a guest blog from the Google Summer of Code project.

 

We’ve all heard the saying, “Rules are meant to be broken.” Regardless of how you might feel about the idea, one thing is certain. Rules must first be learnt. My 2016 Google Summer of Code project revolves around doing just that. I am developing classification rule induction techniques for Orange, and here describing the code currently available in the pull request and that will become part of official distribution in an upcoming release 3.3.8.

Rule induction from examples is recognised as a fundamental component of many machine learning systems. My goal was foremost to implement supervised rule induction algorithms and rule-based classification methods, but also to devise a more general framework of replaceable individual components that users could fine-tune to their needs. To this purpose, separate-and-conquer strategy was applied. In essence, learning instances are covered and removed following a chosen rule. The process is repeated while learning set examples remain. To evaluate found hypotheses and to choose the best rule in each iteration, search heuristics are used (primarily, rule class distribution is the decisive determinant).

The use of the created module is straightforward. New rule induction algorithms can be easily introduced, by either utilising predefined components or developing new ones (these include various search algorithms, search strategies, evaluators, and others). Several well-known rule induction algorithms have already been included. Let’s see how they perform!

Classic CN2 inducer constructs a list of ordered rules (decision list). Here, we load the titanic data set and create a simple classifier, which can already be used to predict data.

import Orange
data = Orange.data.Table('titanic')
learner = Orange.classification.CN2Learner()
classifier = learner(data)

Similarly, a set of unordered rules can be constructed using Unordered CN2 inducer. Rules are learnt for each class individually, in regard to the original learning data. To evaluate found hypotheses, Laplace accuracy measure is used. Having first initialised the learner, we then control the algorithm by modifying its parameters. The underlying components are available to us by accessing the rule finder.

data = Table('iris.tab')
learner = CN2UnorderedLearner()

# consider up to 10 solution streams at one time
learner.rule_finder.search_algorithm.beam_width = 10

# continuous value space is constrained to reduce computation time
learner.rule_finder.search_strategy.bound_continuous = True

# found rules must cover at least 15 examples
learner.rule_finder.general_validator.min_covered_examples = 15

# found rules must combine at most 2 selectors (conditions)
learner.rule_finder.general_validator.max_rule_length = 2

classifier = learner(data)

Induced rules can be quickly reviewed and interpreted. They are each of the form ‘if cond then predict class”. That is, a conjunction of selectors followed by the predicted class.

for rule in classifier.rule_list:
... print(rule, rule.curr_class_dist.tolist())

>>> IF petal length<=3.0 AND sepal width>=2.9 THEN iris=Iris-setosa [49, 0, 0]
>>> IF petal length>=3.0 AND petal length<=4.8 THEN iris=Iris-versicolor [0, 46, 3]
>>> IF petal width>=1.8 AND petal length>=4.9 THEN iris=Iris-virginica [0, 0, 43]
>>> IF TRUE THEN iris=Iris-virginica [50, 50, 50]  # the default rule

If no other rules fire, default rule (majority classification) is used. Specific to each individual rule inducer, the application of the default rule varies.

Though rule learning is most frequently used in the context of predictive induction, it can be adapted to subgroup discovery. In contrast, subgroup discovery aims at learning individual patterns or interesting population subgroups, rather than to maximise classification accuracy. Induced rules prove very valuable in terms of their descriptive power. To this end, CN2-SD algorithms were also implemented.

Hopefully, the addition to the Orange software suite will benefit both novice and expert users looking advance their knowledge in a particular area of study, through a better understanding of given predictions and underlying argumentation.