Visualizing Gradient Descent

This is a guest blog from the Google Summer of Code project.

 

Gradient Descent was implemented as a part of my Google Summer of Code project and it is available in the Orange3-Educational add-on. It simulates gradient descent for either Logistic or Linear regression, depending on the type of the input data. Gradient descent is iterative approach to optimize model parameters that minimize the cost function. In machine learning, the cost function corresponds to prediction error when the model is used on the training data set.

Gradient Descent widget takes data on input and outputs the model and its coefficients.

gradient-descent-flow

The widget displays the value of the cost function given two parameters of the model. For linear regression, we consider feature from the training set with the parameters being the intercept and the slope. For logistic regression, the widget considers two feature and their associated multiplicative parameters, setting the intercept to zero. Screenshot bellow shows gradient descent on a Iris data set, where we consider petal length and sepal width on the input and predict the probability that iris comes from the family of Iris versicolor.

gradient-descent1-stamped

  1. The type of the model used (either Logistic regression or Linear regression)
  2. Input features (one for X and one for Y axis) and the target class
  3. Learning rate is the step size of the gradient descent
  4. In a single iteration step, stochastic approach considers only a single data instance (instead of entire training set). Convergence in terms of iterations steps is slower, and we can instruct the widget to display the progress of optimization only after given number of steps (Step size)
  5. Step through the algorithm (steps can be reverted with step back button)
  6. Run optimization until convergence

 

Following shows gradient descent for linear regression using The Boston Housing Data Set when trying to predict the median value of a house given its age.

gradient-descent-age

On the left we use the regular and on the right the stochastic gradient descent. While the regular descent goes straight to the target, the path of stochastic is not as smooth.

We can use the widget to simulate some dangerous, unwanted behavior of gradient descent. The following screenshots show two extreme cases with too high learning rate where optimization function never converges, and a low learning rate where convergence is painfully slow.

gradient-descent-extrems

The two problems as illustrated above are the reason that many implementations of numerical optimization use adaptive learning rates. We can simulate this in the widget by modifying the learning rate for each step of the optimization.

Visualization of Classification Probabilities

This is a guest blog from the Google Summer of Code project.

 

Polynomial Classification widget is implemented as a part of my Google Summer of Code project along with other widgets in educational add-on (see my previous blog). It visualizes probabilities for two-class classification (target vs. rest) using color gradient and contour lines, and it can do so for any Orange learner.

Here is an example workflow. The data comes from the File widget. With no learner on input, the default is Logistic Regression. Widget outputs learners Coefficients, Classifier (model) and Learner.

poly-classification-flow

Polynomial Classification widget works on two continuous features only, all other features are ignored. The screenshot shows plot of classification for an Iris data set .

polynomial-classification-1-stamped

  1. Set name of the learner. This is the name of learner on output.
  2. Set features that logistic regression is performed on.
  3. Set class that is classified separately from other classes.
  4. Set the degree of a polynom that is used to transform an input data (1 means attributes are not transformed).
  5. Select whether see or not contour lines in chart. The density of contours is regulated by Contour step.

 

The classification for our case fails in separating Iris-versicolor from the other two classes. This is because logistic regression is a linear classifier, and because there is no linear combination of the chosen two attributes that would make for a good decision boundary. We can change that. Polynomial expansion adds features that are polynomial combinations of original ones. For example, if an input data contains features [a, b], polynomial expansion of degree two generates feature space [1, a, b, a2, a b, b2]. With this expansion, the classification boundary looks great.

polynomial-classification-2

 

Polynomial Classification also works well with other learners. Below we have given it a Classification Tree. This time we have painted the input data using Paint Data, a great data generator used while learning about Orange and data science. The decision boundaries for the tree are all square, a well-known limitation for tree-based learners.

poly-classification-4e

 

Polynomial expansion if high degrees may be dangerous. Following example shows overfitting when degree is five. See the two outliers, a blue one on the top and the red one at the lower right of the plot? The classifier was unnecessary able to separate the outliers from the pack, something that will become problematic when classifier will be used on the new data.

poly-classification-owerfit

Overfitting is one of the central problems in machine learning. You are welcome to read our previous blog on this problem and possible solutions.

Interactive k-Means

This is a guest blog from the Google Summer of Code project.

 

As a part of my Google Summer of Code project I started developing educational widgets and assemble them in an Educational Add-On for Orange. Educational widgets can be used by students to understand how some key data mining algorithms work and by teachers to demonstrate the working of these algorithms.

Here I describe an educational widget for interactive k-means clustering, an algorithm that splits the data into clusters by finding cluster centroids such that the distance between data points and their corresponding centroid is minimized. Number of clusters in k-means algorithm is denoted with k and has to be specified manually.

The algorithm starts by randomly positioning the centroids in the data space, and then improving their position by repetition of the following two steps:

  1. Assign each point to the closest centroid.
  2. Move centroids to the mean position of points assigned to the centroid.

The widget needs the data that can come from File widget, and outputs the information on clusters (Annotated Data) and centroids:

kmans_shema

Educational widget for k-means works finds clusters based on two continuous features only, all other features are ignored. The screenshot shows plot of an Iris data set and clustering with k=3. That is partially cheating, because we know that iris data set has three classes, so that we can check if clusters correspond well to original classes:

kmeans2-stamped

  1. Select two features that are used in k-means
  2. Set number of centroids
  3. Randomize positions of centroids
  4. Show lines between centroids and corresponding points
  5. Perform the algorithm step by step. Reassign membership connects points to nearest centroid, Recompute centroids moves centroids.
  6. Step back in the algorithm
  7. Set speed of automatic stepping
  8. Perform the whole algorithm as fast preview
  9.  Anytime we can change number of centroids with spinner or with click in desired position in the graph.

If we want to see the correspondence of clusters that are denoted by k-means and classes, we can open Data Table widget where we see that all Iris-setosas are clustered in one cluster and but there are just few Iris-versicolor that are classified is same cluster together with Iris-virginica and vice versa.

kmeans3

Interactive k-means works great in combination with Paint Data. There, we can design data sets where k-mains fails, and observe why.

kmeans-failt

We could also design data sets where k-means fails under specific initialization of centroids. Ah, I did not tell you that you can freely move the centroids and then restart the algorithm. Below we show the case of centroid initialization and how this leads to non-optimal clustering.

kmeans-f-join

Rule Induction (Part I – Scripting)

This is a guest blog from the Google Summer of Code project.

 

We’ve all heard the saying, “Rules are meant to be broken.” Regardless of how you might feel about the idea, one thing is certain. Rules must first be learnt. My 2016 Google Summer of Code project revolves around doing just that. I am developing classification rule induction techniques for Orange, and here describing the code currently available in the pull request and that will become part of official distribution in an upcoming release 3.3.8.

Rule induction from examples is recognised as a fundamental component of many machine learning systems. My goal was foremost to implement supervised rule induction algorithms and rule-based classification methods, but also to devise a more general framework of replaceable individual components that users could fine-tune to their needs. To this purpose, separate-and-conquer strategy was applied. In essence, learning instances are covered and removed following a chosen rule. The process is repeated while learning set examples remain. To evaluate found hypotheses and to choose the best rule in each iteration, search heuristics are used (primarily, rule class distribution is the decisive determinant).

The use of the created module is straightforward. New rule induction algorithms can be easily introduced, by either utilising predefined components or developing new ones (these include various search algorithms, search strategies, evaluators, and others). Several well-known rule induction algorithms have already been included. Let’s see how they perform!

Classic CN2 inducer constructs a list of ordered rules (decision list). Here, we load the titanic data set and create a simple classifier, which can already be used to predict data.

import Orange
data = Orange.data.Table('titanic')
learner = Orange.classification.CN2Learner()
classifier = learner(data)

Similarly, a set of unordered rules can be constructed using Unordered CN2 inducer. Rules are learnt for each class individually, in regard to the original learning data. To evaluate found hypotheses, Laplace accuracy measure is used. Having first initialised the learner, we then control the algorithm by modifying its parameters. The underlying components are available to us by accessing the rule finder.

data = Table('iris.tab')
learner = CN2UnorderedLearner()

# consider up to 10 solution streams at one time
learner.rule_finder.search_algorithm.beam_width = 10

# continuous value space is constrained to reduce computation time
learner.rule_finder.search_strategy.bound_continuous = True

# found rules must cover at least 15 examples
learner.rule_finder.general_validator.min_covered_examples = 15

# found rules must combine at most 2 selectors (conditions)
learner.rule_finder.general_validator.max_rule_length = 2

classifier = learner(data)

Induced rules can be quickly reviewed and interpreted. They are each of the form ‘if cond then predict class”. That is, a conjunction of selectors followed by the predicted class.

for rule in classifier.rule_list:
... print(rule, rule.curr_class_dist.tolist())

>>> IF petal length<=3.0 AND sepal width>=2.9 THEN iris=Iris-setosa [49, 0, 0]
>>> IF petal length>=3.0 AND petal length<=4.8 THEN iris=Iris-versicolor [0, 46, 3]
>>> IF petal width>=1.8 AND petal length>=4.9 THEN iris=Iris-virginica [0, 0, 43]
>>> IF TRUE THEN iris=Iris-virginica [50, 50, 50]  # the default rule

If no other rules fire, default rule (majority classification) is used. Specific to each individual rule inducer, the application of the default rule varies.

Though rule learning is most frequently used in the context of predictive induction, it can be adapted to subgroup discovery. In contrast, subgroup discovery aims at learning individual patterns or interesting population subgroups, rather than to maximise classification accuracy. Induced rules prove very valuable in terms of their descriptive power. To this end, CN2-SD algorithms were also implemented.

Hopefully, the addition to the Orange software suite will benefit both novice and expert users looking advance their knowledge in a particular area of study, through a better understanding of given predictions and underlying argumentation.

Univariate GSoC Success

Google Summer of Code application period has come to an end. We’ve received 34 applications, some of which were of truly high quality. Now it’s upon us to select the top performing candidates, but before that we wanted to have an overlook of the candidate pool. We’ve gathered data from our Google Form application and gave it a quick view in Orange.

First, we needed to preprocess the data a bit, since it came in a messy form of strings. Feature Constructor to the rescue! We wanted to extract the OS usage across users. So we first made three new variables named ‘uses linux’, ‘uses windows’ and ‘uses osx’ to represent our three new columns. For each column we searched through ‘OS_of_choice_and_why’, looked up the value of the column, converted it to string, put the string in lowercase, found mentions of either ‘linux’, ‘windows’ or ‘osx’, and voila…. if a mention occurred in the string, we marked the column with 1, else with 0.

 

blog10

The expression is just a logical statement in Python and works with booleans (0 if False and 1 if True):

'linux' in str(OS_of_choice_and_why_.value).lower() or 'ubuntu' in str(OS_of_choice_and_why_.value).lower()

 

Another thing we might want to do is create three discrete values for ”Dogs or cats” question. We want Orange to display ‘dogs’ for someone who replied ‘dogs’, ‘cats’ for someone who replied ‘cats’ and ‘?’ if the questions was a blank or very creative (we had people who wanted to be elephants and butterflies 🙂 ).

To create three discrete values you would write:

0 if 'dogs' in str(Dogs_or_cats_.value).lower() else 1 if  'cats' in str(Dogs_or_cats_.value).lower() else 2

Since we have three values, we need to assign them the corresponding indexes. So if there is ‘dogs’ in the reply, we would get 0 (which we converted to ‘dogs’ in the Feature Constructor’s ‘Values’ box), 1 if there’s ‘cats’ in the reply and 2 if none of the above apply.

blog9

Ok, the next step was to sift through a big pile of attributes. We removed personal information for privacy concerns and selected the ones we cared about the most. For example programming skills, years of experience, contributions to OSS and of course whether someone is a dog or a cat person. 🙂 Select Columns sorts the problem. Here you can download a mock-up workflow (same as above, but without sensitive data).

Now for some lovely charts. Enjoy!

blog5
Python is our lingua franca, experts wanted!

 

blog8
20 years of programming experience? Hello outlier!

 

blog2
OSS all the way!

 

blog3
Some people love dogs and some love cats. Others prefer elephants and butterflies.

 

 

Orange GSoC: Computer vision add-on for Orange

This summer I got the chance to develop an add-on for Orange that will introduce basic computer vision functionality, as a part of Google Summer of Code.

The add-on will consist of a set of widgets, each with it’s own dedicated purpose, which can be seamlessly connected to provide most commonly used image preprocessing functionality.

Here is a list of the widgets:

  • Widget for viewing image files (add description)
  • Widget for resizing an image
  • Widget for rotation/flipping of the image
  • Widget for converting the color mode (RGB, HSV, Grayscale etc.)
  • Widget for changing the hue/saturation, brightness/contrast and inverting the image
  • Widget for generic transformations through convolution with a matrix

Also, if there is enough time left throughout the GSoC period, a face detection widget will be built in order to demonstrate the power of the underlying libraries.

These are all things that have been implemented in Python before. Reimplementing them is of course a rather bad idea, so I will use an library called OpenCV. It is written in C++ and has Python bindings, and is the most widely used computer vision library, by far. So the core of the widgets will be written in it, and the GUI using PyQT, the library used for building the Orange Canvas.

Although working with images is not Oranges’ main thing, the knowledge gathered while developing the add-on will be used to improve in a number of ways: finding a general structure for add-ons developed in the future, improving the way they are distributed and the way they are tested.

Finally, I want to thank the Orange core team for having faith in me and giving me the chance to spend the summer working on an idea I care about. I’m very grateful for that and I hope I’ll exceed their expectations.

Orange GSoC: A Fully-Featured Neural Network Library Implementation with Extension for Deep Learning

This project aims to build a neural network library based on some great existing NN libraries, notably the Flood Library, which already provides a fully functional Multilayer Perceptron (MLP) implementation. The project starts with implementing a robust, efficient feed forward neural network library, and then will extend it in significant ways that add support for state-of-the-art deep learning techniques. Additional extensions include building a PCA framework and improving existing training algorithms and error functional.

Orange GSoC: Multi-Target Learning for Orange

Orange already supports multi-target classification, but the current implementation of clustering trees is written in Python. One of the five projects Orange has chosen at this year’s Google Summer of Code is the implementation of clustering trees in C. The goal of my project is to speed up the building time of clustering trees and lower their spatial complexity, especially when used in random forests. Implementation will be based on Orange’s SimpleTreeLearner and will be integrated with Orange 3.0.

Once the clustering trees are implemented and integrated, documentation and unit tests will be written. Additionally I intend to make an experimental study that will compare the effectiveness of clustering trees with established multi-target classifiers (like PLS and chain classifiers) on benchmark data-sets. I will also work on some additional tasks related to multi-target classification that I had not included in my original proposal but Orange’s team thinks would be useful to include. Among these is a chain classifier framework that Orange is currently missing.

If any reader is interested in learning more about clustering trees or chain classifiers these articles should cover the basics:

I am a third year undergraduate student at the Faculty of Computer and Information Science in Ljubljana and my project will be mentored by prof. dr. Blaž Zupan. I thank him and the rest of the Orange team for advice and support.

This year five students participate in Google Summer of Code

This year five students have been accepted to participate in Google Summer of Code and contribute to Orange in their summer time. Congratulations!

  • Amela – Widgets for statistics
  • Andrej T. – Computer vision add-on for Orange
  • CoderWilliam – A Fully-Featured Neural Network Library Implementation Based On the Flood Library with Extension for Deep Learning
  • Makarov Dmitry – Text mining add-on for Orange
  • Miran Levar – Multi-Target Learning for Orange

Overall, 1,212 students have been accepted this year to various open source organizations from all around the world.

Orange is again participating in GSoC

This year Orange is again participating in Google Summer of Code as a mentoring organization. Student proposal submission period is running and the deadline is on 6th April. We have prepared a page on our Trac with more information about the Google Summer of Code program, especially how the interested students should apply with their proposals. There is also a list of of some ideas we are proposing for this year but feel free to suggest your own ideas how you could contribute to Orange and make it even better.

This year poster:

GSoC 2012 english poster

Google Summer of Code is a Google-sponsored program where Google stipends students working for a summer job on an open source projects from all around the world. Student is paid $5000 (and a t-shirt!) for approximately two months of work/contribution to the project. More about the program is available on its homepage.