Pythagorean Trees and Forests

Classification Trees are great, but how about when they overgrow even your 27” screen? Can we make the tree fit snugly onto the screen and still tell the whole story? Well, yes we can.

 

Pythagorean Tree widget will show you the same information as Classification Tree, but way more concisely. Pythagorean Trees represent nodes with squares whose size is proportionate to the number of covered training instances. Once the data is split into two subsets, the corresponding new squares form a right triangle on top of the parent square. Hence Pythagorean Tree. Every square has the color of the prevalent, with opacity indicating the relative proportion of the majority class in the subset. Details are shown in hover balloons.

ClassificationTree
Classification Tree with titanic.tab data set.

 

PythagoreanTree
Pythagorean Tree with titanic.tab data set.

 

When you hover over a square in Pythagorean Tree, a whole line of parent and child squares/nodes is highlighted. Clicking on a square/node outputs the selected subset, just like in Classification Tree.

PythagoreanTree2
Upon hovering on the square in the tree, the lineage (parent and child nodes) is highlighted. Hover also displays information on the subset, represented by the square. The widget outputs the selected subset.

 

Another amazing addition to Orange’s Visualization set is Pythagorean Forest, which is a visualization of Random Forest algorithm. Random Forest takes N samples from a data set with N instances, but with replacement. Then a tree is grown for each sample, which alleviates the Classification Tree’s tendency to overfit the data. Pythagorean Forest is a concise visualization of Random Forest, with each Pythagorean Tree plotted side by side.

PythagoreanForest
Different trees are grown side by side. Parameters for the algorithm are set in Random Forest widget, then the whole forest is sent to Pythagorean Forest for visualization.

 

This makes Pythagorean Forest a great tool to explain how Random Forest works or to further explore each tree in Pythagorean Tree widget.

 

schema-pythagora

 

 

 

 

 

Network analysis with Orange

Visualizing relations between data instances can tell us a lot about our data. Let’s see how this works in Orange. We have a data set on machine learning and data mining conferences and journals, with the number of shared authors for each publication venue reported. We can estimate similarity between two conferences using the author profile of a conference: two conference would be similar if they attract the same authors. The data set is already 9 years old, but obviously, it’s about the principle. 🙂 We’ve got two data files: one is a distance file with distance scores already calculated by Jaccard index and the other is a standard conferences.tab file.

conferences
Conferences.tab data file with the type of the publication venue (conference or journal) and average number of authors and published papers.

 

We load .tab file with the File widget (data set already comes with Orange) and .dst file with the Distance File widget (select ‘Browse documentation data sets’ and choose conferences.dst).

Distance File Widget
You can find conferences.dst in ‘Browse documentation data sets’.

 

Now we would like to create a graph from the distance file. Connect Distance File to Network from Distances. In the widget, we’ve selected a high distance threshold, because we would like to get more connections between nodes. We’ve also checked ‘Include also closest neighbors’ to see each node connected with at least one other node.

network-from-distances
We’ve set a high distance threshold, since we wanted to display connections between most of our nodes.

 

We can visualize our graph in Network Explorer. What we get is a quite uninformative network of conferences with labelled nodes. Now for the fun part. Connect the File widget with Network Explorer and set the link type to ‘Node Data’. This will match the two domains and display additional labelling options in Network Explorer.

link-to-node-data
Remove the ‘Node Subset’ link and connect Data to Node Data. This will display other attributes in Network Explorer by which you can label and color your network nodes.

 

network-explorer-conferences
Nodes are colored by event type (conference or journal) and adjusted in size by the average number of authors per event (bigger nodes represent larger events).

 

We’ve colored the nodes by type and set the size of the nodes to the number of authors per conference/paper. Finally, we’ve set the node label to ‘name’. Seems like International Conference on AI and Law and AI and Law journal are connected through the number of shared authors. Same goes for AI in Medicine in Europe conference and AI and Medicine journal. Connections indeed make sense.

conference1
The entire workflow.

 

There are many other things you can do with the Networks add-on in Orange. You can color nodes by predictions, highlight misclassifications or output only nodes with certain network parameters. But for today, let this be it.

Rehaul of Text Mining add-on

Google Summer of Code is progressing nicely and some major improvements are already live! Our students have been working hard and today we’re thanking Alexey for his work on Text Mining add-on. Two major tasks before the midterms were to introduce Twitter widget and rehaul Preprocess Text. Twitter widget was designed to be a part of our summer school program and it worked beautifully. We’ve introduced youngsters to the world of data mining through social networks and one of the most exciting things was to see whether we can predict the author from the tweet content.

Twitter widget offers many functionalities. Since we wanted to get tweets from specific authors, we entered their Twitter handles as queries and set ‘Search by Author’. We only included Author, Content and Date in the query parameters, as we want to predict the author only on the basis of text.

Twitter1-stamped

  1. Provide API keys.
  2. Insert queries separated by newline.
  3. Search by content, author or both.
  4. Set date (1 week limit from tweepy module).
  5. Select language you want your tweets to be in.
  6. If ‘Max tweets’ is checked, you can set the maximum number of tweets you want to query. Otherwise the widget will provide all tweets matching the query.
  7. If ‘Accumulate results’ is checked, new queries will be appended to the old ones.
  8. Select what kind of data you want to retrieve.
  9. Tweet count.
  10. Press ‘Search’ to start your query.

We got 208 tweets on the output. Not bad. Now we need to preprocess them first, before we do any predictions. We transformed all the words into lowercase and split (tokenized) them by words. We didn’t use any normalization (below turned on just as an example) and applied a simple stopword removal.

PreprocessText1-stamped

  1. Information on the input and output.
  2. Transformation applies basic modifications of text.
  3. Tokenization splits the corpus into tokens according to the selected method (regexp is set to extract only words by default).
  4. Normalization lemmatizes words (do, did, done –> do).
  5. Filtering extracts only desired tokens (without stopwords, including only specified words, or by frequency).

Then we passed the tokens through a Bag of Words and observed the results in a Word Cloud.

wordcloud-twitter

Then we simply connected Bag of Words to Test & Score and used several classifiers to see which one works best. We used Classification Tree and Nearest Neighbors since they are easy to explain even to teenagers. Especially Classification Tree offers a nice visualization in Classification Tree Viewer that makes the idea of the algorithm easy to understand. Moreover we could observe the most distinctive words in the tree.

classtree1

Do these make any sense? You be the judge. 🙂

We checked classification results in Test&Score, counted misclassifications in Confusion Matrix and finally observed them in Corpus Viewer. k-NN seems to perform moderately well, while Classification Tree fails miserably. Still, this was trained on barely 200 tweets. Perhaps accumulating results over time might give us much better results. You can now certainly try it on your own! Update your Orange3-Text add-on or install it via ‘pip install Orange3-Text’!

schema-twitter-preprocess

Above is the final workflow. Preprocessing on the left. Testing and scoring on the right bottom. Construction of classification tree right and above.

Scripting with time variable

It’s always fun to play around with data. And since Orange can, as of a few months ago, read temporal data, we decided to parse some data we had and put it into Orange.

TimeVariable is an extended class of continuous variable and it works with properly formated ISO standard datetime (Y-M-D h:m:s). Oftentimes our original data is not in the right format and needs to be edited first, so Orange can read it. Python’s own datetime module is of great help. You can give it any date format and tell it how to interpret it in the argument.

import datetime
date = "13.03.2013 13:13:31"
new_date = str(datetime.datetime.strptime(date, "%d.%m.%Y %H:%M:%S"))
>>> '2013-03-13 13:13:31'

 

Do this for all your datetime attributes. This will transform them into strings that Orange’s TimeVariable can read. Then create a new data table:

import Orange
domain = Orange.data.Domain([TimeVariable.make("timestamp")])
timestamps = ["2013-03-13 13:13:31", "2014-04-14 14:14:41", "2015-05-15 15:15:51"]
#create a new TimeVariable object
time_var = TimeVariable()
#it's important to parse strings into floats with var.parse(i)
#list(zip(data)) then transforms the list into a 2d list of lists
time_data = Orange.data.Table(domain, list(zip(var.parse(i) for i in timestamps)))

 

Now say you have some original data you want to append your new data to.

data = Orange.data.Table.concatenate([original_data, time_data])
Table.save(data, "data.tab")

 

But what if you want to select only a few attributes from the original data? It can be arranged.

original_data = Orange.data.Table("original_data.tab")
new_domain = Domain(["attribute_1", "attribute_2"], source=original_data.domain)
new_data = Orange.data.Table(new_domain, original_data)

 

Then concatenate again:

data = Orange.data.Table.concatenate([new_data, time_data])
Table.save(data, "selected_data.tab")

 

Remember, if your data has string variables, they will always be in meta attributes.

domain = Domain(["some_attribute1", "other_attribute2"], metas=["some_string_variable"])

 

Have fun scripting!

Oasys: Orange Canvas applied to Optical Physics

This week we’re hosting experts in optical physics from Elettra Sincrotrone Trieste and European Synchrotron Radiation Facility in our laboratory. For a long time they have been interested in developing a user interface that integrates different simulation tools and data analysis software within one environment. It all came true with Orange Canvas and the OASYS system. We’ve already written about this two years ago, when the idea first came up. Now the actual software is ready and is being used by researchers for everyday analysis and prototyping.

 

OASYS is basically pure Orange Canvas (Orange but no data mining widgets) that is reconfigured for the needs of optical physicists. What our partners from Italy did (with the help of our lab), was bring optic simulation software used in synchrotron facilities into a single graphical user interface. What is especially incredible is that they managed to transform Orange into a simulation platform for building synchrotron beamlines.

 

In essence, researches in synchrotrons experiment with actual physical objects, such as mirrors and crystals of different shape and size to transmit photons from several light sources of the synchrotron to the experimental endstations. They measure a broad array of material properties through the interaction with the synchrotron light and are trying to simulate different experiment settings before actually building a real-life experiment in the synchrotron. And this is where OASYS truly shines.

beamline in synchrotron

beamline in synchrotron

Widgets in this case become parts of the simulation pipeline. Each widget has an input and output beam of light, just like real life devices, and the parameters within the widget are physical properties of a particular experimental object. Thus scientists can model the experiment in advance and do it much quicker and easier than before.

visualization of light properties

Furthermore, Orange and OASYS provide a user-friendly GUI that domain experts can quickly get used to. There are anecdotal evidences of renowned physicists, who preferred to do their analysis with outdated simulation tools. However, after using OASYS for just a few days, they were already completely comfortable and could reproduce previously calculated results in a software without any problem. Moreover, they did it within several days instead of weeks as before.

 

This is the power of visual programming – providing a user-friendly interface for automating complicated calculations and quick prototyping.

Association Rules in Orange

Orange is welcoming back one of its more exciting add-ons: Associate! Association rules can help the user quickly and simply discover the underlying relationships and connections between data instances. Yeah!

 

The add-on currently has two widgets: one for Association Rules and the other for Frequent Itemsets. With Frequent Itemsets we first check frequency of items and itemsets in our transaction matrix. This tell us which items (products) and itemsets are the most frequent in our data, so it would make a lot of sense focusing on these products. Let’s use this widget on real Foodmart 2000 data set.

blog5

 

First let’s check our data set. We have 62560 instances with 102 features. That’s a whole lot of transactions. Now we connect Frequent Itemsets to our File widget and observe the results. We went with a quite low minimal support due to the large number of transactions.

Collapse All will display the most frequent items, so these will be our most important products (‘bestsellers’). Our clients seem to be buying a whole lot of fresh vegetables and fresh fruit. Call your marketing department – you could become the ultimate place to buy fruits and veggies from.

blog2

 

If there’s a little arrow on the left side of the item, you can expand it to see all the other items connected to the selected attribute. So if a person buy fresh vegetables, it is most likely to buy fresh fruits as an accompanying product group. Now you can explore frequent itemsets to understand what really sells in your store.

blog3

 

Ok. Now how about some transaction flows? We’re mostly interested in the action-consequence relationship here. In other words, if a person buys one item, what is the most likely second item she will buy? Association Rules will help us discover that.

 

Our parameters will again be adjusted for our data set. We probably want low support, since it will be hard to find a few prevailing rules for 62,000+ transactions. However, you want the discovered rules to be true most of the time, so increase the confidence.

blog1

 

The table on the right displays a list of rules with 6 different measures of association rule quality:

  • support: how often a rule is applicable to a given data set (rule/data)
  • confidence: how frequently items in Y appear in transactions with X or in other words how frequently the rule is true (support for a rule/support of antecedent)
  • coverage: how often antecedent item is found in the data set (support of antecedent/data)
  • strength:  (support of consequent/support of antecedent)
  • lift: how frequently a rule is true per consequent item (data * confidence/support of consequent)
  • leverage: the difference between two item appearing in a transaction and the two items appearing independently (support*data – antecedent support * consequent support/data2)

 

Orange will rank the rules automatically. Now give a quick look at the rules. How about these two rules?

fresh vegetables, plastic utensils, deli meats, wine –> dried fruit

fresh vegetables, plastic utensils, bologna, soda –> chocolate candy

These seem to picnickers, clients who don’t want to spend a whole lot of time preparing their food. The first group is probably more gourmet, while the second seems to enjoy sweets. A logical step would be to place dried fruit closer to the wine section and the candy bars closer to sodas. What do you say? This already happened in your local supermarket? Coincidence? I don’t think so. 🙂

blog6

 

Association rules are a powerful way to improve your business by organizing your actual or online store, adjusting marketing strategies to target suitable groups, providing product recommendations and generally understanding your client base better. Just another way Orange can be used as a business intelligence tool!

 

 

Univariate GSoC Success

Google Summer of Code application period has come to an end. We’ve received 34 applications, some of which were of truly high quality. Now it’s upon us to select the top performing candidates, but before that we wanted to have an overlook of the candidate pool. We’ve gathered data from our Google Form application and gave it a quick view in Orange.

First, we needed to preprocess the data a bit, since it came in a messy form of strings. Feature Constructor to the rescue! We wanted to extract the OS usage across users. So we first made three new variables named ‘uses linux’, ‘uses windows’ and ‘uses osx’ to represent our three new columns. For each column we searched through ‘OS_of_choice_and_why’, looked up the value of the column, converted it to string, put the string in lowercase, found mentions of either ‘linux’, ‘windows’ or ‘osx’, and voila…. if a mention occurred in the string, we marked the column with 1, else with 0.

 

blog10

The expression is just a logical statement in Python and works with booleans (0 if False and 1 if True):

'linux' in str(OS_of_choice_and_why_.value).lower() or 'ubuntu' in str(OS_of_choice_and_why_.value).lower()

 

Another thing we might want to do is create three discrete values for ”Dogs or cats” question. We want Orange to display ‘dogs’ for someone who replied ‘dogs’, ‘cats’ for someone who replied ‘cats’ and ‘?’ if the questions was a blank or very creative (we had people who wanted to be elephants and butterflies 🙂 ).

To create three discrete values you would write:

0 if 'dogs' in str(Dogs_or_cats_.value).lower() else 1 if  'cats' in str(Dogs_or_cats_.value).lower() else 2

Since we have three values, we need to assign them the corresponding indexes. So if there is ‘dogs’ in the reply, we would get 0 (which we converted to ‘dogs’ in the Feature Constructor’s ‘Values’ box), 1 if there’s ‘cats’ in the reply and 2 if none of the above apply.

blog9

Ok, the next step was to sift through a big pile of attributes. We removed personal information for privacy concerns and selected the ones we cared about the most. For example programming skills, years of experience, contributions to OSS and of course whether someone is a dog or a cat person. 🙂 Select Columns sorts the problem. Here you can download a mock-up workflow (same as above, but without sensitive data).

Now for some lovely charts. Enjoy!

blog5
Python is our lingua franca, experts wanted!

 

blog8
20 years of programming experience? Hello outlier!

 

blog2
OSS all the way!

 

blog3
Some people love dogs and some love cats. Others prefer elephants and butterflies.

 

 

Version 3.3.1 – Updates and Features

About a week ago we issued an updated stable release of Orange, version 3.3.1. We’ve introduced some new functionalities and improved a few old ones.

Here’s what’s new in this release:

1. New widgets: Distance Matrix for visualizing distance measures in a matrix, Distance Transformation for normalization and inversion of distance matrix, Save Distance Matrix and Distance File for saving and loading distances. Last week we also mentioned a really amazing Silhouette Plot, which helps you visually assess cluster quality.

blog11

 

2. Orange can now read datetime variables in its Time Variable format.

blog12

 

3. Rank outputs scores for each scoring method.

blog13

 

4. Report function had been added to Linear Regression, Univariate Regression, Stochastic Gradient Descent and Distance Transformation widgets.

blog14

 

5. FCBF algorithm has been added to Rank for feature scoring and ReliefF now supports missing target values.

6. Graphs in Classification Tree Viewer can be saved in .dot format.

 

You can view the entire changelog here. 🙂 Enjoy the improvements!

All I see is Silhouette

Silhouette plot is such a nice method for visually assessing cluster quality and the degree of cluster membership that we simply couldn’t wait to get it into Orange3. And now we did.

What this visualization displays is the average distance between instances within the cluster and instances in the nearest cluster. For a given data instance, the silhouette close to 1 indicates that the data instance is close to the center of the cluster. Instances with silhouette scores close to 0 are on the border between two clusters. Overall, the quality of the clustering could be assessed by the average silhouette scores of the data instances. But here, we are more interested in the individual silhouettes and their visualization in the silhouette plot.

Using the good old iris data set, we are going to assess the silhouettes for each of the data instances. In k-means we set the number of clusters to 3 and send the data to Silhouette plot. Good clusters should include instances with higher silhouette scores. But we’re doing the opposite. In Orange, we are selecting instances with scores close to 0 from the silhouette plot and pass them to other widgets for exploration. No surprise, they are at the periphery of two clusters. This is so perfectly demonstrated in the scatter plot.

silhouette4

Let’s do something wild now. We’ll use the silhouette on a class attribute of Iris (no clustering here, just using the original class values from the data set). Here is our hypothesis: the data instances with low silhouette values are also those that will be misclassified by some learning algorithm. Say, by a random forest.

silhouette1

We will use ten-fold cross validation in Test&Score, send the evaluation results to confusion matrix and select misclassified instances in the widget. Then we will explore the inclusion of these misclassifications in the set of low-silhouette instances in the Venn diagram. The agreement (i.e. the intersection in Venn) between the two techniques is quite high.

silhouette3

Finally, we can observe these instances in the Scatter Plot. Classifiers indeed have problems with borderline data instances. Our hypothesis was correct.

silhouette4

Silhouette plot is yet another one of the great visualizations that can help you with data analysis or with understanding certain machine learning concepts. What did we say? Fruitful and fun!

 

 

Overfitting and Regularization

A week ago I used Orange to explain the effects of regularization. This was the second lecture in the Data Mining class, the first one was on linear regression. My introduction to the benefits of regularization used a simple data set with a single input attribute and a continuous class. I drew a data set in Orange, and then used Polynomial Regression widget (from Prototypes add-on) to plot the linear fit. This widget can also expand the data set by adding columns with powers of original attribute x, thereby augmenting the training set with x^p, where x is our original attribute and p an integer going from 2 to K. The polynomial expansion of data sets allows linear regression model to nicely fit the data, and with higher K to overfit it to extreme, especially if the number of data points in the training set is low.

poly-overfit

We have already blogged about this experiment a while ago, showing that it is easy to see that linear regression coefficients blow out of proportion with increasing K. This leads to the idea that linear regression should not only minimize the squared error when predicting the value of dependent variable in the training set, but also keep model coefficients low, or better, penalize any high value of coefficients. This procedure is called regularization. Based on the type of penalty (sum of coefficient squared or sum of absolute values), the regularization is referred to L1 or L2, or, ridge and lasso regression.

It is quite easy to play with regularized models in Orange by attaching a Linear Regression widget to Polynomial Regression, in this way substituting the default model used in Polynomial Regression with the one designed in Linear Regression widget. This makes available different kinds of regularization. This workflow can be used to show that the regularized models less overfit the data, and that the overfitting depends on the regularization coefficient which governs the degree of penalty stemming from the value of coefficients of the linear model.

poly-l2

I also use this workflow to show the difference between L1 and L2 regularization. The change of the type of regularization is most pronounced in the table of coefficients (Data Table widget), where with L1 regularization it is clear that this procedure results in many of those being 0. Try this with high value for degree of polynomial expansion, and a data set with about 10 data points. Also, try changing the regularization regularization strength (Linear Regression widget).

poly-l1

While the effects of overfitting and regularization are nicely visible in the plot in Polynomial Regression widget, machine learning models are really about predictions. And the quality of predictions should really be estimated on independent test set. So at this stage of the lecture I needed to introduce the model scoring, that is, a measure that tells me how well my model inferred on the training set performs on the test set. For simplicity, I chose to introduce root mean squared error (RMSE) and then crafted the following workflow.

poly-evaluate

Here, I draw the data set (Paint Data, about 20 data instances), assigned y as the target variable (Select Columns), split the data to training and test sets of approximately equal sizes (Data Sampler), and pass training and test data and linear model to the Test & Score widget. Then I can use linear regression with no regularization, and expect how RMSE changes with changing the degree of the polynomial. I can alternate between Test on train data and Test on test data (Test & Score widget). In the class I have used the blackboard to record this dependency. For the data from the figure, I got the following table:

Poly K RMSE Train RMSE Test
0 0.147 0.138
1 0.155 0.192
2 0.049 0.063
3 0.049 0.063
4 0.049 0.067
5 0.040 0.408
6 0.040 0.574
7 0.033 2.681
8 0.001 5.734
9 0.000 4.776

That’s it. For the class of computer scientists, one may do all this in scripting, but for any other audience, or for any introductory lesson, explaining of regularization with Orange widgets is a lot of fun.