Explaining Kickstarter Success

On Kickstarter most app ideas don’t get funded. But why is that? When we are looking for possible explanations, it is easy to ascribe the failure to the type of the idea.

But what about those rare cases, where an app idea gets funded? Can we figure out why a particular idea succeeded? Our new widget Explain Predictions can do just that – explain why they will succeed. Or at least, explain why the classifier thinks they will.

First, let us load the Kickstarter data from the Datasets widget and inspect it in a Data Table.

Select the data instance you wish to explore in a Data Table.

Now, let’s see why the app Create Games & Apps Without Any Coding got funded.

Explain Predictions needs 3 inputs. Our data set, a classifier and a data sample that we wish to inspect. Connect File widget with Explain Predictions. Then add the classifier, say, Logistic Regression. Finally, select Create Games & Apps Without Any Coding in the Data Table and connect it to the widget.

Explain Predictions needs three inputs.

The highest ranking attributes are those that contributed the most (high Score value). The fact that there were 11 pledge levels, 13 images, many connections to other projects and the length of the project description – all of these attributes add something positive to the funding. On the other side, we see how the duration of the project, description length, maximal pledge tiers and the type of the idea work against the decision to fund the project. Lastly, not having a Facebook page or a video amounts to almost nothing in the making of the final prediction.

High score means the attribute contributed positively to the the final decision (Funded: yes), while low scores contributed negatively.

When explaining the decision of the classifier, we look at the values of the attributes for our sample and how they interact. We do that by approximating Shapely value, since calculating it exactly would sometimes take more then a lifetime. That means customized explanations for every individual case, while treating classifier like a black box. You could do the same for any model the Orange offers, including Neural Networks!

And there you have it, an easy way to know what makes your Kickstarter campaign succeed, cell be classified as healthy, or a bank loan approved.

Neural Network is Back!

We know you’ve missed it. We’ve been getting many requests to bring back Neural Network widget, but we also had many reservations about it.

Neural networks are powerful and great, but to do them right is not straight-forward. And to do them right in the context of a GUI-based visual programming tool like Orange is a twisted double helix of a roller coaster.

Do we make each layer a widget and then stack them? Do we use parallel processing or try to do something server-side? Theano or Keras? Tensorflow perhaps?

We were so determined to do things properly, that after the n-th iteration we still had no clue what to actually do.

Then one day a silly novice programmer (a.k.a. me) had enough and just threw scikit-learn’s Multi-layer Perceptron model into a widget and called it a day. There you go. A Neural Network widget just like it was in Orange2 – a wrapper for a scikit’s function that works out-of-the-box. Nothing fancy, nothing powerful, but it does its job. It models things and it predicts things.

Just like that:

Have fun with the new widget!

 

 

 

 

Visualization of Classification Probabilities

This is a guest blog from the Google Summer of Code project.

 

Polynomial Classification widget is implemented as a part of my Google Summer of Code project along with other widgets in educational add-on (see my previous blog). It visualizes probabilities for two-class classification (target vs. rest) using color gradient and contour lines, and it can do so for any Orange learner.

Here is an example workflow. The data comes from the File widget. With no learner on input, the default is Logistic Regression. Widget outputs learners Coefficients, Classifier (model) and Learner.

poly-classification-flow

Polynomial Classification widget works on two continuous features only, all other features are ignored. The screenshot shows plot of classification for an Iris data set .

polynomial-classification-1-stamped

  1. Set name of the learner. This is the name of learner on output.
  2. Set features that logistic regression is performed on.
  3. Set class that is classified separately from other classes.
  4. Set the degree of a polynom that is used to transform an input data (1 means attributes are not transformed).
  5. Select whether see or not contour lines in chart. The density of contours is regulated by Contour step.

 

The classification for our case fails in separating Iris-versicolor from the other two classes. This is because logistic regression is a linear classifier, and because there is no linear combination of the chosen two attributes that would make for a good decision boundary. We can change that. Polynomial expansion adds features that are polynomial combinations of original ones. For example, if an input data contains features [a, b], polynomial expansion of degree two generates feature space [1, a, b, a2, a b, b2]. With this expansion, the classification boundary looks great.

polynomial-classification-2

 

Polynomial Classification also works well with other learners. Below we have given it a Classification Tree. This time we have painted the input data using Paint Data, a great data generator used while learning about Orange and data science. The decision boundaries for the tree are all square, a well-known limitation for tree-based learners.

poly-classification-4e

 

Polynomial expansion if high degrees may be dangerous. Following example shows overfitting when degree is five. See the two outliers, a blue one on the top and the red one at the lower right of the plot? The classifier was unnecessary able to separate the outliers from the pack, something that will become problematic when classifier will be used on the new data.

poly-classification-owerfit

Overfitting is one of the central problems in machine learning. You are welcome to read our previous blog on this problem and possible solutions.

Interactive k-Means

This is a guest blog from the Google Summer of Code project.

 

As a part of my Google Summer of Code project I started developing educational widgets and assemble them in an Educational Add-On for Orange. Educational widgets can be used by students to understand how some key data mining algorithms work and by teachers to demonstrate the working of these algorithms.

Here I describe an educational widget for interactive k-means clustering, an algorithm that splits the data into clusters by finding cluster centroids such that the distance between data points and their corresponding centroid is minimized. Number of clusters in k-means algorithm is denoted with k and has to be specified manually.

The algorithm starts by randomly positioning the centroids in the data space, and then improving their position by repetition of the following two steps:

  1. Assign each point to the closest centroid.
  2. Move centroids to the mean position of points assigned to the centroid.

The widget needs the data that can come from File widget, and outputs the information on clusters (Annotated Data) and centroids:

kmans_shema

Educational widget for k-means works finds clusters based on two continuous features only, all other features are ignored. The screenshot shows plot of an Iris data set and clustering with k=3. That is partially cheating, because we know that iris data set has three classes, so that we can check if clusters correspond well to original classes:

kmeans2-stamped

  1. Select two features that are used in k-means
  2. Set number of centroids
  3. Randomize positions of centroids
  4. Show lines between centroids and corresponding points
  5. Perform the algorithm step by step. Reassign membership connects points to nearest centroid, Recompute centroids moves centroids.
  6. Step back in the algorithm
  7. Set speed of automatic stepping
  8. Perform the whole algorithm as fast preview
  9.  Anytime we can change number of centroids with spinner or with click in desired position in the graph.

If we want to see the correspondence of clusters that are denoted by k-means and classes, we can open Data Table widget where we see that all Iris-setosas are clustered in one cluster and but there are just few Iris-versicolor that are classified is same cluster together with Iris-virginica and vice versa.

kmeans3

Interactive k-means works great in combination with Paint Data. There, we can design data sets where k-mains fails, and observe why.

kmeans-failt

We could also design data sets where k-means fails under specific initialization of centroids. Ah, I did not tell you that you can freely move the centroids and then restart the algorithm. Below we show the case of centroid initialization and how this leads to non-optimal clustering.

kmeans-f-join

Color it!

Holiday season is upon us and even the Orange team is in a festive mood. This is why we made a Color widget!

color1

This fascinating artsy widget will allow you to play with your data set in a new and exciting way. No more dull visualizations and default color schemes! Set your own colors the way YOU want it to! Care for some magical cyan-to-magenta? Or do you prefer a more festive red-to-green? How about several shades of gray? Color widget is your go-to stop for all things color (did you notice it’s our only widget with a colorful icon?). 🙂

Coloring works with most visualization widgets, such as scatter plot, distributions, box plot, mosaic display and linear projection. Set the colors for discrete values and gradients for continuous values in this widget, and the same palletes will be used in all downstream widgets. As a bonus, the Color widget also allows you to edit the names of variables and values.

color6

Remember – the (blue) sky is the limit.

Hubbing with the Hub widget

So you have painted two data sets and loaded another one from a file, and now you are testing predictions of logistic regression, classification trees and SVM on it? Tired of having to reconnect the Paint data widget and the File widget back and forth whenever you switch between them?

Say no more! Look no further! Here is the new Hub widget!

Multiple file inputs

 

Hub widget is the most versatile widget available so far. It accepts several inputs of any type and outputs them to as many other widgets as you want.

The Hub widget treats all types with the strictest equality.

(It also adheres to all applicable EU policies with respect to gender equality, and does not use cookies.)

Diverse widget input

The Hub widget works like charm and is like the amazing cast-to-void-and-back-to-anything idiom in C. This strongful MacGyver of widgets can (almost) convert classification tree into data, or preprocessor into experimental results without ever touching the data. With its amazing capabilities, the Hub widget has the potential to cause an even greater havoc in your workflows than the famous Merge data widget.

Download, install – and start hubbing today !!