Association Rules in Orange

Orange is welcoming back one of its more exciting add-ons: Associate! Association rules can help the user quickly and simply discover the underlying relationships and connections between data instances. Yeah!

 

The add-on currently has two widgets: one for Association Rules and the other for Frequent Itemsets. With Frequent Itemsets we first check frequency of items and itemsets in our transaction matrix. This tell us which items (products) and itemsets are the most frequent in our data, so it would make a lot of sense focusing on these products. Let’s use this widget on real Foodmart 2000 data set.

blog5

 

First let’s check our data set. We have 62560 instances with 102 features. That’s a whole lot of transactions. Now we connect Frequent Itemsets to our File widget and observe the results. We went with a quite low minimal support due to the large number of transactions.

Collapse All will display the most frequent items, so these will be our most important products (‘bestsellers’). Our clients seem to be buying a whole lot of fresh vegetables and fresh fruit. Call your marketing department – you could become the ultimate place to buy fruits and veggies from.

blog2

 

If there’s a little arrow on the left side of the item, you can expand it to see all the other items connected to the selected attribute. So if a person buy fresh vegetables, it is most likely to buy fresh fruits as an accompanying product group. Now you can explore frequent itemsets to understand what really sells in your store.

blog3

 

Ok. Now how about some transaction flows? We’re mostly interested in the action-consequence relationship here. In other words, if a person buys one item, what is the most likely second item she will buy? Association Rules will help us discover that.

 

Our parameters will again be adjusted for our data set. We probably want low support, since it will be hard to find a few prevailing rules for 62,000+ transactions. However, you want the discovered rules to be true most of the time, so increase the confidence.

blog1

 

The table on the right displays a list of rules with 6 different measures of association rule quality:

  • support: how often a rule is applicable to a given data set (rule/data)
  • confidence: how frequently items in Y appear in transactions with X or in other words how frequently the rule is true (support for a rule/support of antecedent)
  • coverage: how often antecedent item is found in the data set (support of antecedent/data)
  • strength:  (support of consequent/support of antecedent)
  • lift: how frequently a rule is true per consequent item (data * confidence/support of consequent)
  • leverage: the difference between two item appearing in a transaction and the two items appearing independently (support*data – antecedent support * consequent support/data2)

 

Orange will rank the rules automatically. Now give a quick look at the rules. How about these two rules?

fresh vegetables, plastic utensils, deli meats, wine –> dried fruit

fresh vegetables, plastic utensils, bologna, soda –> chocolate candy

These seem to picnickers, clients who don’t want to spend a whole lot of time preparing their food. The first group is probably more gourmet, while the second seems to enjoy sweets. A logical step would be to place dried fruit closer to the wine section and the candy bars closer to sodas. What do you say? This already happened in your local supermarket? Coincidence? I don’t think so. 🙂

blog6

 

Association rules are a powerful way to improve your business by organizing your actual or online store, adjusting marketing strategies to target suitable groups, providing product recommendations and generally understanding your client base better. Just another way Orange can be used as a business intelligence tool!

 

 

Univariate GSoC Success

Google Summer of Code application period has come to an end. We’ve received 34 applications, some of which were of truly high quality. Now it’s upon us to select the top performing candidates, but before that we wanted to have an overlook of the candidate pool. We’ve gathered data from our Google Form application and gave it a quick view in Orange.

First, we needed to preprocess the data a bit, since it came in a messy form of strings. Feature Constructor to the rescue! We wanted to extract the OS usage across users. So we first made three new variables named ‘uses linux’, ‘uses windows’ and ‘uses osx’ to represent our three new columns. For each column we searched through ‘OS_of_choice_and_why’, looked up the value of the column, converted it to string, put the string in lowercase, found mentions of either ‘linux’, ‘windows’ or ‘osx’, and voila…. if a mention occurred in the string, we marked the column with 1, else with 0.

 

blog10

The expression is just a logical statement in Python and works with booleans (0 if False and 1 if True):

'linux' in str(OS_of_choice_and_why_.value).lower() or 'ubuntu' in str(OS_of_choice_and_why_.value).lower()

 

Another thing we might want to do is create three discrete values for ”Dogs or cats” question. We want Orange to display ‘dogs’ for someone who replied ‘dogs’, ‘cats’ for someone who replied ‘cats’ and ‘?’ if the questions was a blank or very creative (we had people who wanted to be elephants and butterflies 🙂 ).

To create three discrete values you would write:

0 if 'dogs' in str(Dogs_or_cats_.value).lower() else 1 if  'cats' in str(Dogs_or_cats_.value).lower() else 2

Since we have three values, we need to assign them the corresponding indexes. So if there is ‘dogs’ in the reply, we would get 0 (which we converted to ‘dogs’ in the Feature Constructor’s ‘Values’ box), 1 if there’s ‘cats’ in the reply and 2 if none of the above apply.

blog9

Ok, the next step was to sift through a big pile of attributes. We removed personal information for privacy concerns and selected the ones we cared about the most. For example programming skills, years of experience, contributions to OSS and of course whether someone is a dog or a cat person. 🙂 Select Columns sorts the problem. Here you can download a mock-up workflow (same as above, but without sensitive data).

Now for some lovely charts. Enjoy!

blog5
Python is our lingua franca, experts wanted!

 

blog8
20 years of programming experience? Hello outlier!

 

blog2
OSS all the way!

 

blog3
Some people love dogs and some love cats. Others prefer elephants and butterflies.

 

 

Version 3.3.1 – Updates and Features

About a week ago we issued an updated stable release of Orange, version 3.3.1. We’ve introduced some new functionalities and improved a few old ones.

Here’s what’s new in this release:

1. New widgets: Distance Matrix for visualizing distance measures in a matrix, Distance Transformation for normalization and inversion of distance matrix, Save Distance Matrix and Distance File for saving and loading distances. Last week we also mentioned a really amazing Silhouette Plot, which helps you visually assess cluster quality.

blog11

 

2. Orange can now read datetime variables in its Time Variable format.

blog12

 

3. Rank outputs scores for each scoring method.

blog13

 

4. Report function had been added to Linear Regression, Univariate Regression, Stochastic Gradient Descent and Distance Transformation widgets.

blog14

 

5. FCBF algorithm has been added to Rank for feature scoring and ReliefF now supports missing target values.

6. Graphs in Classification Tree Viewer can be saved in .dot format.

 

You can view the entire changelog here. 🙂 Enjoy the improvements!

All I see is Silhouette

Silhouette plot is such a nice method for visually assessing cluster quality and the degree of cluster membership that we simply couldn’t wait to get it into Orange3. And now we did.

What this visualization displays is the average distance between instances within the cluster and instances in the nearest cluster. For a given data instance, the silhouette close to 1 indicates that the data instance is close to the center of the cluster. Instances with silhouette scores close to 0 are on the border between two clusters. Overall, the quality of the clustering could be assessed by the average silhouette scores of the data instances. But here, we are more interested in the individual silhouettes and their visualization in the silhouette plot.

Using the good old iris data set, we are going to assess the silhouettes for each of the data instances. In k-means we set the number of clusters to 3 and send the data to Silhouette plot. Good clusters should include instances with higher silhouette scores. But we’re doing the opposite. In Orange, we are selecting instances with scores close to 0 from the silhouette plot and pass them to other widgets for exploration. No surprise, they are at the periphery of two clusters. This is so perfectly demonstrated in the scatter plot.

silhouette4

Let’s do something wild now. We’ll use the silhouette on a class attribute of Iris (no clustering here, just using the original class values from the data set). Here is our hypothesis: the data instances with low silhouette values are also those that will be misclassified by some learning algorithm. Say, by a random forest.

silhouette1

We will use ten-fold cross validation in Test&Score, send the evaluation results to confusion matrix and select misclassified instances in the widget. Then we will explore the inclusion of these misclassifications in the set of low-silhouette instances in the Venn diagram. The agreement (i.e. the intersection in Venn) between the two techniques is quite high.

silhouette3

Finally, we can observe these instances in the Scatter Plot. Classifiers indeed have problems with borderline data instances. Our hypothesis was correct.

silhouette4

Silhouette plot is yet another one of the great visualizations that can help you with data analysis or with understanding certain machine learning concepts. What did we say? Fruitful and fun!

 

 

Overfitting and Regularization

A week ago I used Orange to explain the effects of regularization. This was the second lecture in the Data Mining class, the first one was on linear regression. My introduction to the benefits of regularization used a simple data set with a single input attribute and a continuous class. I drew a data set in Orange, and then used Polynomial Regression widget (from Prototypes add-on) to plot the linear fit. This widget can also expand the data set by adding columns with powers of original attribute x, thereby augmenting the training set with x^p, where x is our original attribute and p an integer going from 2 to K. The polynomial expansion of data sets allows linear regression model to nicely fit the data, and with higher K to overfit it to extreme, especially if the number of data points in the training set is low.

poly-overfit

We have already blogged about this experiment a while ago, showing that it is easy to see that linear regression coefficients blow out of proportion with increasing K. This leads to the idea that linear regression should not only minimize the squared error when predicting the value of dependent variable in the training set, but also keep model coefficients low, or better, penalize any high value of coefficients. This procedure is called regularization. Based on the type of penalty (sum of coefficient squared or sum of absolute values), the regularization is referred to L1 or L2, or, ridge and lasso regression.

It is quite easy to play with regularized models in Orange by attaching a Linear Regression widget to Polynomial Regression, in this way substituting the default model used in Polynomial Regression with the one designed in Linear Regression widget. This makes available different kinds of regularization. This workflow can be used to show that the regularized models less overfit the data, and that the overfitting depends on the regularization coefficient which governs the degree of penalty stemming from the value of coefficients of the linear model.

poly-l2

I also use this workflow to show the difference between L1 and L2 regularization. The change of the type of regularization is most pronounced in the table of coefficients (Data Table widget), where with L1 regularization it is clear that this procedure results in many of those being 0. Try this with high value for degree of polynomial expansion, and a data set with about 10 data points. Also, try changing the regularization regularization strength (Linear Regression widget).

poly-l1

While the effects of overfitting and regularization are nicely visible in the plot in Polynomial Regression widget, machine learning models are really about predictions. And the quality of predictions should really be estimated on independent test set. So at this stage of the lecture I needed to introduce the model scoring, that is, a measure that tells me how well my model inferred on the training set performs on the test set. For simplicity, I chose to introduce root mean squared error (RMSE) and then crafted the following workflow.

poly-evaluate

Here, I draw the data set (Paint Data, about 20 data instances), assigned y as the target variable (Select Columns), split the data to training and test sets of approximately equal sizes (Data Sampler), and pass training and test data and linear model to the Test & Score widget. Then I can use linear regression with no regularization, and expect how RMSE changes with changing the degree of the polynomial. I can alternate between Test on train data and Test on test data (Test & Score widget). In the class I have used the blackboard to record this dependency. For the data from the figure, I got the following table:

Poly K RMSE Train RMSE Test
0 0.147 0.138
1 0.155 0.192
2 0.049 0.063
3 0.049 0.063
4 0.049 0.067
5 0.040 0.408
6 0.040 0.574
7 0.033 2.681
8 0.001 5.734
9 0.000 4.776

That’s it. For the class of computer scientists, one may do all this in scripting, but for any other audience, or for any introductory lesson, explaining of regularization with Orange widgets is a lot of fun.

Orange at Google Summer of Code 2016

Orange team is extremely excited to be a part of this year’s Google Summer of Code! GSoC is a great opportunity for students around the world to spend their summer contributing to an open-source software, gaining experience and earning money.

Accepted students will help us develop Orange (or other chosen OSS project) from May to August. Each student is expected to select and define a project of his/her interest and will be ascribed a mentor to guide him/her through the entire process.

 

Apply here:

https://summerofcode.withgoogle.com/

Orange’s project proposals (we accept your own ideas as well!):

https://github.com/biolab/orange3/wiki/GSoC-2016

Our GSoC community forum:

https://groups.google.com/forum/#!forum/orange-gsoc

 

Spread the word! (and don’t forget to apply 😉 )

 

1

Getting Started Series: Part Two

We’ve recently published two more videos in our Getting Started with Orange series. The series is intended to introduce beginners to Orange and teach them how to use its components.

 

The first video explains how to do hierarchical clustering and select interesting subsets directly in Orange:

 

while the second video introduces classification trees and predictive modelling:

 

The seventh video in the series will address how to score classification and regression models by different evaluation methods. Fruits and vegetables data set can be found here.

 

If you have an idea what you’d like to see in the upcoming videos, let us know!

Tips and tricks for data preparation

Probably the most crucial step in your data analysis is purging and cleaning your data. Here are a couple of cool tricks that will make your data preparation a bit easier.

 

1. Use a smart text editor. We can recommend Sublime Text as it an extremely versatile editor that supports a broad variety of programming languages and markups, but there are other great tools out there as well. One of the best things you’ll keep coming back to in your editor is ‘Replace’ function that allows you to replace specified values with different ones. You can also use regex to easily find and replace parts of text.

editing data with Sublime
We can replace all instances of ‘male’ with ‘man’ in one click.

 

2. Apply simple hacks. Sometimes when converting files to different formats data can get some background information appended that you cannot properly see. A cheap and dirty trick is to manually select the cells and rows and copy-paste them to a new sheet. This will start with a clean slate and you data will be read properly.

 

3. Check your settings. When reading .csv files in Excel, you might see all your data squished in one column and literally separated with commas. This can be easily solved with Data –> From Text (Get external data) and a new window will appear. In a Text Import Wizard you can set whether your data is delimited or not (in our case it is), how it is delimited (comma, tab, etc.), whether you have a header or not, what qualifies as text (” is a recommended option), what is your encoding and so on.

 

4. Manually annotate the data. Orange loves headers and the easiest way to assure your data gets read properly is to set the header yourself. Add two extra rows under your feature names. In the first row, set your variable type and in the second one, your kind. Here’s how to do it properly.

 

5. Exploit the widgets in Orange. Select Columns is your go-to widget for organizing what gets read as a meta attribute, what is your class variable and which features you want to use in your analysis. Another great widget is Edit domain, where you can set the way the values are displayed in the analysis (say you have “grey” in your data, but you want it to say “gray”). Moreover, you can use Concatenate and Merge widgets to put your data together.

setting the domain
Set domain with Edit domain widget.

 

What’s your preferred trick?

Making Predictions

One of the cool things about being a data scientist is being able to predict. That is, predict before we know the actual outcome. I am not talking about verifying your favorite classification algorithm here, and I am not talking about cross-validation or classification accuracies or AUC or anything like that. I am talking about the good old prediction. This is where our very own Predictions widget comes to help.

predictive analytics
Predictions workflow.

 

We will be exploring the Iris data set again, but we’re going to add a little twist to it. Since we’ve worked so much with it already, I’m sure you know all about this data. But now we got three new flowers in the office and of course there’s no label attached to tell us what species of Iris these flowers are. [sigh….] Obviously, we will be measuring petals and sepals and contrasting the results with our data.

predictive analytics
Our new data on three flowers. We have used Google Sheets to enter the data and the copied the sharable link and pasted the link to the File widget.

 

But surely you don’t want to go through all 150 flowers to properly match the three new Irises? So instead, let’s first train a model on the existing data set. We connect the File widget to the chosen classifier (we went with Classification Tree this time) and feed the results into Predictions. Now we write down the measurements for our new flowers into Google Sheets (just like above), load it into Orange with a new File widget and input the fresh data into Predictions. We can observe the predicted class directly in the widget itself.

predictive analytics
Predictions made by classification tree.

 

In the left part of the visualization we have the input data set (our measurements) and in the right part the predictions made with classification tree. By default you see probabilities for all three class values and the predicted class. You can of course use other classifiers as well – it would probably make sense to first evaluate classifiers on the existing data set, find the best one for your and then use it on the new data.

 

Orange YouTube Tutorials

It’s been a long time coming, but finally we’ve created out our first set of YouTube tutorials. In a series ‘Getting Started with Orange’ we will walk through our software step-by-step. You will learn how to create a workflow, load your data in different formats, visualize and explore the data. These tutorials are meant for complete beginners in both Orange and data mining and come with some handy tricks that will make using Orange very easy. Below are the first three videos from this series, more are coming in the following weeks.

 

 

We are also preparing a series called ‘Data Science with Orange’, which will take you on a journey through the world of data mining and machine learning by explaining predictive modeling, classification, regression, model evaluation and much more.

Feel free to let us know what tutorials you’d like to see and we’ll do our best to include it in one of the two series. 🙂