Ever had a hard time telling the difference between Claude Monet and Édouard Manet? Orange can help you cluster these two authors and even more, discover which of Monet’s masterpiece is indeed very similar to Manet’s! Use Image Analytics add-on and play with it. Here’s how:
We’ve been having a blast with recent Orange workshops. While Blaž was getting tanned in India, Anže and I went to the charming Liverpool to hold a session for business school professors on how to teach business with Orange.
Obviously, when we say teach business, we mean how to do data mining for business, say predict churn or employee attrition, segment customers, find which items to recommend in an online store and track brand sentiment with text analysis.
For this purpose, we have made some updates to our Associate add-on and added a new data set to Data Sets widget which can be used for customer segmentation and discovering which item groups are frequently bought together. Like this:
We load the Online Retail data set.
Since we have transactions in rows and items in columns, we have to transpose the data table in order to compute distances between items (rows). We could also simply ask Distances widget to compute distances between columns instead of rows. Then we send the transposed data table to Distances and compute cosine distance between items (cosine distance will only tell us, which items are purchased together, disregarding the amount of items purchased).
Finally, we observe the discovered clusters in Hierarchical Clustering. Seems like mugs and decorative signs are frequently bought together. Why so? Select the group in Hierarchical Clustering and observe the cluster in a Data Table. Consider this an exercise in data exploration. 🙂
The second workshop was our standard Introduction to Data Mining for Ministry of Public Affairs.
This group, similar to the one from India, was a pack of curious individuals who asked many interesting questions and were not shy to challenge us. How does a Tree know which attribute to split by? Is Tree better than Naive Bayes? Or is perhaps Logistic Regression better? How do we know which model works best? And finally, what is the mean of sauerkraut and beans? It has to be jota!
Workshops are always fun, when you have a curious set of individuals who demand answers! 🙂
Our streak of workshops continues. This time we taught professionals from public administration how they can leverage data analytics and machine learning to retrieve interesting information from surveys. Thanks to the Ministry of Public Administration, this is only the first in a line of workshops on data science we are preparing for public sector employees.
For this purpose, we have designed EnKlik Anketa widget, which you can find in Prototypes add-on. The widget reads data from a Slovenian online survey service OneClick Survey and imports the results directly into Orange.
We have prepared a test survey, which you can import by entering a public link to data into the widget. Here’s the link: https://www.1ka.si/podatki/141025/72F5B3CC/ . Copy it into the Public link URL line in the widget. Once you press Enter, the widget loads the data and displays retrieved features, just like the File widget.
The survey is in Slovenian, but we can use Edit Domain to turn feature names into English equivalent.
As always, we can check the data in a Data Table. We have 41 respondents and 7 questions. Each respondent chose a nickname, which makes it easier to browse the data.
Now we can perform familiar clustering to uncover interesting groups in our data. Connect Distances to Edit Domain and Hierarchical Clustering to Distances.
We have two outliers, Pipi and Chad. One is an excessive sportsman (100 h of sport per week) and the other terminally ill (general health -1). Or perhaps they both simply didn’t fill out the survey correctly. If we use the Data Table to filter out Pipi and Chad, we get a fairly good clustering.
We can use Box Plot, to observe what makes each cluster special. Connect Box Plot to Hierarchical Clustering (with the two groups selected), select grouping by Cluster and tick Order by relevance.
Seems like our second cluster (C2) is the sporty one. If we are serving in the public administration, perhaps we can design initiatives targeting cluster C1 to do more sports. It is so easy to analyze the data in Orange!
Data does not always come in a nice tabular form. It can also be a collection of text, audio recordings, video materials or even images. However, computers can only work with numbers, so for any data mining, we need to transform such unstructured data into a vector representation.
For retrieving numbers from unstructured data, Orange can use deep network embedders. We have just started to include various embedders in Orange, and for now, they are available for text and images.
Here, we give an example of image embedding and show how easy is to use it in Orange. Technically, Orange would send the image to the server, where the server would push an image through a pre-trained deep neural network, like Google’s Inception v3. Deep networks were most often trained with some special purpose in mind. Inception v3, for instance, can classify images into any of 1000 image classes. We can disregard the classification, consider instead the penultimate layer of the network with 2048 nodes (numbers) and use that for image’s vector-based representation.
Let’s see this on an example.
Here we have 19 images of domestic animals. First, download the images and unzip them. Then use Import Images widget from Orange’s Image Analytics add-on and open the directory containing the images.
We can visualize images in Image Viewer widget. Here is our workflow so far, with images shown in Image Viewer:
But what do we see in a data table? Only some useless description of images (file name, the location of the file, its size, and the image width and height).
This cannot help us with machine learning. As I said before, we need numbers. To acquire numerical representation of these images, we will send the images to Image Embedding widget.
Great! Now we have the numbers we wanted. There are 2048 of them (columns n0 to n2047). From now on, we can apply all the standard machine learning techniques, say, clustering.
Let us measure the distance between these images and see which are the most similar. We used Distances widget to measure the distance. Normally, cosine distance works best for images, but you can experiment on your own. Then we passed the distance matrix to Hierarchical Clustering to visualize similar pairs in a dendrogram.
This looks very promising! All the right animals are grouped together. But I can’t see the results so well in the dendrogram. I want to see the images – with Image Viewer!
So cool! All the cow family is grouped together! Now we can click on different branches of the dendrogram and observe which animals belong to which group.
But I know what you are going to say. You are going to say I am cheating. That I intentionally selected similar images to trick you.
I will prove you wrong. I will take a new cow, say, the most famous cow in Europe – Milka cow.
This image is quite different from the other images – it doesn’t have a white background, it’s a real (yet photoshopped) photo and the cow is facing us. Will the Image Embedding find the right numerical representation for this cow?
Indeed it has. Milka is nicely put together with all the other cows.
Image analytics is such an exciting field in machine learning and now Orange is a part of it too! You need to install the Image Analytics add on and you are all set for your research!
k-Means is one of the most popular unsupervised learning algorithms for finding interesting groups in our data. It can be useful in customer segmentation, finding gene families, determining document types, improving human resource management and so on.
But… have you ever wondered how k-means works? In the following three videos we explain how to construct a data analysis workflow using k-means, how k-means works, how to find a good k value and how silhouette score can help us find the inliers and the outliers.
#1 Constructing workflow with k-means
#2 How k-means works [interactive visualization]
#3 How silhouette score works and why it is useful
Every year BEST Ljubljana organizes BEST Days of Technology and Sciences, an event hosting a broad variety of workshops, hackathons and lectures for the students of natural sciences and technology. Introduction to Data Science, organized by our own Laboratory for Bioinformatics, was this year one of them.
The task was to teach and explain basic data mining concepts and techniques in four hours. To complete beginners. Not daunting at all…
Luckily, we had Orange at hand. First, we showed how the program works and how to easily import data into the software. We created a poll using Google Forms on the fly and imported the results from Google Sheets into Orange.
To get the first impression of our data, we used Distributions and Scatter Plot. This was just to show how to approach the construction and simple visual exploration on any new data set. Then we delved deep into the workings of classification with Classification Tree and Tree Viewer and showed how easy it is to fall into the trap of overfitting (and how to avoid it). Another topic was clustering and how to relate similar data instances to one another. Finally, we had some fun with ImageAnalytics add-on and observed whether we can detect wrongly labelled microscopy images with machine learning.
These workshops are not only fun, but an amazing learning opportunity for us as well, as they show how our users think and how to even further improve Orange.
One of the key techniques of exploratory data mining is clustering – separating instances into distinct groups based on some measure of similarity. We can estimate the similarity between two data instances through euclidean (pythagorean), manhattan (sum of absolute differences between coordinates) and mahalanobis distance (distance from the mean by standard deviation), or, say, through Pearson correlation or Spearman correlation.
Our main goal when clustering data is to get groups of data instances where:
- each group (Ci) is a a subset of the training data (U): Ci ⊂ U
- an intersection of all the sets is an empty set: Ci ∩ Cj = 0
- a union of all groups equals the train data: Ci ∪ Cj = U
This would be ideal. But we rarely get the data, where separation is so clear. One of the easiest techniques to cluster the data is hierarchical clustering. First, we take an instance from, say, 2D plot. Now we want to find its nearest neighbor. Nearest neighbor of course depends on the measure of distance we choose, but let’s go with euclidean for now as it is the easiest to visualize.
Euclidean distance is calculated as:
Naturally, the shorter the distance the more similar the two instances are. In the beginning, all instances are in their own particular clusters. Then we seek for the closest instances of every instance in the plot. We pin down the closest instance and make a cluster of the original and the closest instance. Now we repeat the process again. What is the closest instances to our new cluster –> add it to the cluster –> find the closest instance. We repeat this procedure until all the instances are grouped in one single cluster.
We can write this down also in a form of a pseudocode:
every instance is in its own cluster repeat until instances are all in one group: find the closest instances to the group (distance has to be minimum) join closest instances with the group
Visualization of this procedure is called a dendrogram, which is what Hierarchical clustering widget displays in Orange.
Another thing to consider is the distance between instances when we have already two or more instances in a cluster. Do we go with the closest instance in a cluster or to the furthest one?
- Picture A shows the distances to the closest instance – single linkage.
- Picture B shows the distance to the furthest instance – complete linkage.
- Picture C shows the average of all distances in a cluster to the instance – average linkage.
The downside of single linkage is, even by intuition, creating elongated, stretched clusters. Instances at the top part of the red C are in fact quite different from the lower part of the red C. Complete linkage does much better here as it centers clustering nicely. However, the downside of complete linkage is taking outliers too much into consideration. Naturally, each approach has its own pros and cons and it’s good to know how they work in order to use them correctly. One extra hint: single linkage works great for image recognition, exactly because it can follow the curve.
There’s a lot more we could say about hierarchical clustering, but to sum it up, let’s state pros and cons of this method:
- pros: sums up the data, good for small data sets
- cons: computationally demanding, fails on larger sets
Paint Data widget might initially look like a kids’ game, but in combination with other Orange widgets it becomes a very simple and useful tool for conveying statistical concepts, such as k-means, hierarchical clustering and prediction models (like SVM, logistical regression, etc.).
The widget enables you to draw your data on a 2-D plane. You can name the x and y axes, select the number of classes (which are represented by different colors) and then position the points on a graph.
The data will be represented in a data table with two attributes, where their instances correspond to coordinates in the system. Such data set is great for demonstrating k-means and hierarchical clustering methods. Just like we do below. In the screenshot we see that k-means, with our particular settings, recognizes clusters way better than hierarchical clustering. It returns a score rank, where the best score (the one with the highest value) means the most likely number of clusters. Hierarchical clustering, however, doesn’t even group the right classes together.
Another way to use Paint Data is to observe the performance of classification methods, where we can alter the graph to demonstrate improvement or deterioration of prediction models. By painting the data points we can try to construct the data set, which would be difficult for one but easy for another classifier. Say, why does linear SVM fail on the data set below?
I am lately having fun with Image Viewer. The widget has been recently updated and can display images stored locally or on the internet. But wait, what images? How on earth can Orange now display images if it can handle mere tabular or basket-based data?
Here’s an example. I have considered a subset of animals from the zoo.csv data set (comes with Orange installation), and for demonstration purposes selected only a handful of attributes. I have added a new string attribute (“images”) and declared that this is a meta attribute of the type “image”. The values of this attribute are links to images on the web:
Here is the resulting data set, zoo-with-images.csv. I have used this data set in a schema with hierarchical clustering, where upon selection of the part of the clustering tree I can display the associated images:
Typically and just like above, you would use a string meta attribute to store the link to images. Images can be referred to using a HTTP address, or, if stored locally, using a relative path from the data file location to the image files.
Here is another example, where all the images were local and we have associated them with a famous digits data set ( digits.zip is a data set in the Orange format with the image files). The task for this data set is to classify handwritten digits based on their bitmap representation. In the schema below we wanted to find out which are the most frequent errors some classification algorithm would make, and how do the images of the misclassified digits look like. Turns out that SVM with RBF kernel most often misclassify the digit 9 and confuses it with a digit 3: