Association Rules in Orange

Orange is welcoming back one of its more exciting add-ons: Associate! Association rules can help the user quickly and simply discover the underlying relationships and connections between data instances. Yeah!

The add-on currently has two widgets: one for Association Rules and the other for Frequent Itemsets. With Frequent Itemsets we first check frequency of items and itemsets in our transaction matrix. This tell us which items (products) and itemsets are the most frequent in our data, so it would make a lot of sense focusing on these products. Let’s use this widget on real Foodmart 2000 data set.

blog5

First let’s check our data set. We have 62560 instances with 102 features. That’s a whole lot of transactions. Now we connect Frequent Itemsets to our File widget and observe the results. We went with a quite low minimal support due to the large number of transactions.

Collapse All will display the most frequent items, so these will be our most important products (‘bestsellers’). Our clients seem to be buying a whole lot of fresh vegetables and fresh fruit. Call your marketing department – you could become the ultimate place to buy fruits and veggies from.

blog2

If there’s a little arrow on the left side of the item, you can expand it to see all the other items connected to the selected attribute. So if a person buy fresh vegetables, it is most likely to buy fresh fruits as an accompanying product group. Now you can explore frequent itemsets to understand what really sells in your store.

blog3

Ok. Now how about some transaction flows? We’re mostly interested in the action-consequence relationship here. In other words, if a person buys one item, what is the most likely second item she will buy? Association Rules will help us discover that.

Our parameters will again be adjusted for our data set. We probably want low support, since it will be hard to find a few prevailing rules for 62,000+ transactions. However, you want the discovered rules to be true most of the time, so increase the confidence.

blog1

The table on the right displays a list of rules with 6 different measures of association rule quality:

  • support: how often a rule is applicable to a given data set (rule/data)
  • confidence: how frequently items in Y appear in transactions with X or in other words how frequently the rule is true (support for a rule/support of antecedent)
  • coverage: how often antecedent item is found in the data set (support of antecedent/data)
  • strength:  (support of consequent/support of antecedent)
  • lift: how frequently a rule is true per consequent item (data * confidence/support of consequent)
  • leverage: the difference between two item appearing in a transaction and the two items appearing independently (support*data – antecedent support * consequent support/data2)

Orange will rank the rules automatically. Now give a quick look at the rules. How about these two rules?

fresh vegetables, plastic utensils, deli meats, wine –> dried fruit

fresh vegetables, plastic utensils, bologna, soda –> chocolate candy

These seem to picnickers, clients who don’t want to spend a whole lot of time preparing their food. The first group is probably more gourmet, while the second seems to enjoy sweets. A logical step would be to place dried fruit closer to the wine section and the candy bars closer to sodas. What do you say? This already happened in your local supermarket? Coincidence? I don’t think so. 🙂

blog6

Association rules are a powerful way to improve your business by organizing your actual or online store, adjusting marketing strategies to target suitable groups, providing product recommendations and generally understanding your client base better. Just another way Orange can be used as a business intelligence tool!

Univariate GSoC Success

Google Summer of Code application period has come to an end. We’ve received 34 applications, some of which were of truly high quality. Now it’s upon us to select the top performing candidates, but before that we wanted to have an overlook of the candidate pool. We’ve gathered data from our Google Form application and gave it a quick view in Orange.

First, we needed to preprocess the data a bit, since it came in a messy form of strings. Feature Constructor to the rescue! We wanted to extract the OS usage across users. So we first made three new variables named ‘uses linux’, ‘uses windows’ and ‘uses osx’ to represent our three new columns. For each column we searched through ‘OS_of_choice_and_why’, looked up the value of the column, converted it to string, put the string in lowercase, found mentions of either ‘linux’, ‘windows’ or ‘osx’, and voila…. if a mention occurred in the string, we marked the column with 1, else with 0.

 

blog10

The expression is just a logical statement in Python and works with booleans (0 if False and 1 if True):

'linux' in str(OS_of_choice_and_why_.value).lower() or 'ubuntu' in str(OS_of_choice_and_why_.value).lower()

 

Another thing we might want to do is create three discrete values for ”Dogs or cats” question. We want Orange to display ‘dogs’ for someone who replied ‘dogs’, ‘cats’ for someone who replied ‘cats’ and ‘?’ if the questions was a blank or very creative (we had people who wanted to be elephants and butterflies 🙂 ).

To create three discrete values you would write:

0 if 'dogs' in str(Dogs_or_cats_.value).lower() else 1 if  'cats' in str(Dogs_or_cats_.value).lower() else 2

Since we have three values, we need to assign them the corresponding indexes. So if there is ‘dogs’ in the reply, we would get 0 (which we converted to ‘dogs’ in the Feature Constructor’s ‘Values’ box), 1 if there’s ‘cats’ in the reply and 2 if none of the above apply.

blog9

Ok, the next step was to sift through a big pile of attributes. We removed personal information for privacy concerns and selected the ones we cared about the most. For example programming skills, years of experience, contributions to OSS and of course whether someone is a dog or a cat person. 🙂 Select Columns sorts the problem. Here you can download a mock-up workflow (same as above, but without sensitive data).

Now for some lovely charts. Enjoy!

blog5
Python is our lingua franca, experts wanted!

 

blog8
20 years of programming experience? Hello outlier!

 

blog2
OSS all the way!

 

blog3
Some people love dogs and some love cats. Others prefer elephants and butterflies.

 

 

Version 3.3.1 – Updates and Features

About a week ago we issued an updated stable release of Orange, version 3.3.1. We’ve introduced some new functionalities and improved a few old ones.

Here’s what’s new in this release:

1. New widgets: Distance Matrix for visualizing distance measures in a matrix, Distance Transformation for normalization and inversion of distance matrix, Save Distance Matrix and Distance File for saving and loading distances. Last week we also mentioned a really amazing Silhouette Plot, which helps you visually assess cluster quality.

blog11

 

2. Orange can now read datetime variables in its Time Variable format.

blog12

 

3. Rank outputs scores for each scoring method.

blog13

 

4. Report function had been added to Linear Regression, Univariate Regression, Stochastic Gradient Descent and Distance Transformation widgets.

blog14

 

5. FCBF algorithm has been added to Rank for feature scoring and ReliefF now supports missing target values.

6. Graphs in Classification Tree Viewer can be saved in .dot format.

 

You can view the entire changelog here. 🙂 Enjoy the improvements!