Rule Induction (Part I – Scripting)

This is a guest blog from the Google Summer of Code project.

 

We’ve all heard the saying, “Rules are meant to be broken.” Regardless of how you might feel about the idea, one thing is certain. Rules must first be learnt. My 2016 Google Summer of Code project revolves around doing just that. I am developing classification rule induction techniques for Orange, and here describing the code currently available in the pull request and that will become part of official distribution in an upcoming release 3.3.8.

Rule induction from examples is recognised as a fundamental component of many machine learning systems. My goal was foremost to implement supervised rule induction algorithms and rule-based classification methods, but also to devise a more general framework of replaceable individual components that users could fine-tune to their needs. To this purpose, separate-and-conquer strategy was applied. In essence, learning instances are covered and removed following a chosen rule. The process is repeated while learning set examples remain. To evaluate found hypotheses and to choose the best rule in each iteration, search heuristics are used (primarily, rule class distribution is the decisive determinant).

The use of the created module is straightforward. New rule induction algorithms can be easily introduced, by either utilising predefined components or developing new ones (these include various search algorithms, search strategies, evaluators, and others). Several well-known rule induction algorithms have already been included. Let’s see how they perform!

Classic CN2 inducer constructs a list of ordered rules (decision list). Here, we load the titanic data set and create a simple classifier, which can already be used to predict data.

import Orange
data = Orange.data.Table('titanic')
learner = Orange.classification.CN2Learner()
classifier = learner(data)

Similarly, a set of unordered rules can be constructed using Unordered CN2 inducer. Rules are learnt for each class individually, in regard to the original learning data. To evaluate found hypotheses, Laplace accuracy measure is used. Having first initialised the learner, we then control the algorithm by modifying its parameters. The underlying components are available to us by accessing the rule finder.

data = Table('iris.tab')
learner = CN2UnorderedLearner()

# consider up to 10 solution streams at one time
learner.rule_finder.search_algorithm.beam_width = 10

# continuous value space is constrained to reduce computation time
learner.rule_finder.search_strategy.bound_continuous = True

# found rules must cover at least 15 examples
learner.rule_finder.general_validator.min_covered_examples = 15

# found rules must combine at most 2 selectors (conditions)
learner.rule_finder.general_validator.max_rule_length = 2

classifier = learner(data)

Induced rules can be quickly reviewed and interpreted. They are each of the form ‘if cond then predict class”. That is, a conjunction of selectors followed by the predicted class.

for rule in classifier.rule_list:
... print(rule, rule.curr_class_dist.tolist())

>>> IF petal length<=3.0 AND sepal width>=2.9 THEN iris=Iris-setosa [49, 0, 0]
>>> IF petal length>=3.0 AND petal length<=4.8 THEN iris=Iris-versicolor [0, 46, 3]
>>> IF petal width>=1.8 AND petal length>=4.9 THEN iris=Iris-virginica [0, 0, 43]
>>> IF TRUE THEN iris=Iris-virginica [50, 50, 50]  # the default rule

If no other rules fire, default rule (majority classification) is used. Specific to each individual rule inducer, the application of the default rule varies.

Though rule learning is most frequently used in the context of predictive induction, it can be adapted to subgroup discovery. In contrast, subgroup discovery aims at learning individual patterns or interesting population subgroups, rather than to maximise classification accuracy. Induced rules prove very valuable in terms of their descriptive power. To this end, CN2-SD algorithms were also implemented.

Hopefully, the addition to the Orange software suite will benefit both novice and expert users looking advance their knowledge in a particular area of study, through a better understanding of given predictions and underlying argumentation.

Association Rules in Orange

Orange is welcoming back one of its more exciting add-ons: Associate! Association rules can help the user quickly and simply discover the underlying relationships and connections between data instances. Yeah!

The add-on currently has two widgets: one for Association Rules and the other for Frequent Itemsets. With Frequent Itemsets we first check frequency of items and itemsets in our transaction matrix. This tell us which items (products) and itemsets are the most frequent in our data, so it would make a lot of sense focusing on these products. Let’s use this widget on real Foodmart 2000 data set.

blog5

First let’s check our data set. We have 62560 instances with 102 features. That’s a whole lot of transactions. Now we connect Frequent Itemsets to our File widget and observe the results. We went with a quite low minimal support due to the large number of transactions.

Collapse All will display the most frequent items, so these will be our most important products (‘bestsellers’). Our clients seem to be buying a whole lot of fresh vegetables and fresh fruit. Call your marketing department – you could become the ultimate place to buy fruits and veggies from.

blog2

If there’s a little arrow on the left side of the item, you can expand it to see all the other items connected to the selected attribute. So if a person buy fresh vegetables, it is most likely to buy fresh fruits as an accompanying product group. Now you can explore frequent itemsets to understand what really sells in your store.

blog3

Ok. Now how about some transaction flows? We’re mostly interested in the action-consequence relationship here. In other words, if a person buys one item, what is the most likely second item she will buy? Association Rules will help us discover that.

Our parameters will again be adjusted for our data set. We probably want low support, since it will be hard to find a few prevailing rules for 62,000+ transactions. However, you want the discovered rules to be true most of the time, so increase the confidence.

blog1

The table on the right displays a list of rules with 6 different measures of association rule quality:

  • support: how often a rule is applicable to a given data set (rule/data)
  • confidence: how frequently items in Y appear in transactions with X or in other words how frequently the rule is true (support for a rule/support of antecedent)
  • coverage: how often antecedent item is found in the data set (support of antecedent/data)
  • strength:  (support of consequent/support of antecedent)
  • lift: how frequently a rule is true per consequent item (data * confidence/support of consequent)
  • leverage: the difference between two item appearing in a transaction and the two items appearing independently (support*data – antecedent support * consequent support/data2)

Orange will rank the rules automatically. Now give a quick look at the rules. How about these two rules?

fresh vegetables, plastic utensils, deli meats, wine –> dried fruit

fresh vegetables, plastic utensils, bologna, soda –> chocolate candy

These seem to picnickers, clients who don’t want to spend a whole lot of time preparing their food. The first group is probably more gourmet, while the second seems to enjoy sweets. A logical step would be to place dried fruit closer to the wine section and the candy bars closer to sodas. What do you say? This already happened in your local supermarket? Coincidence? I don’t think so. 🙂

blog6

Association rules are a powerful way to improve your business by organizing your actual or online store, adjusting marketing strategies to target suitable groups, providing product recommendations and generally understanding your client base better. Just another way Orange can be used as a business intelligence tool!