Last week Blaž, Marko and I held a week long introductory Data Mining and Machine Learning course at the Ljubljana Doctoral Summer School 2018. We got a room full of dedicated students and we embarked on a journey through standard and advanced machine learning techniques, all presented of course in Orange. We have covered a wide array of topics, from different clustering techniques (hierarchical clustering, k-means) to predictive models (logistic regression, naive Bayes, decision trees, random forests), regression and regularization, projections, text mining and image analytics.
Definitely the biggest crowd-pleaser was the Geo add-on in combination with the HDI data set. First, we got the HDI data from Datasets. A quick glimpse into a data table to check the output. We have information on some key performance indicators gathered by the United Nations for 188 countries. Now we would like to know which countries are similar based on the reported indicators. We will use Distances with Euclidean distance and use Ward linkage in Hierarchical Clustering.
We got our results in a dendrogram. Interestingly, the United States seems similar to Cuba. Let us select this cluster and inspect what the most significant feature for this cluster. We will use the Data output of Hierarchical Clustering which append a column indicating whether the data instances was selected or not. Then we will use Box Plot, group by Selected and check Order by relevance. It seems like these countries have the longest life expectancy at age 59. Go ahead and inspect other clusters by yourself!
Of course, when we are talking about countries one naturally wants to see them on a map! That is easy. We will use the Geo add-on. First, we need to convert all the country names to geographical coordinates. We will do this with Geocoding, where we will encode column Country to latitude and longitude. Remember to use the same output as before, that is Data to Data.
Now, let us display these countries on a map with Choropleth widget. Beautiful. It is so easy to explore country data, when you see it on a map. You can try coloring also by HDI or any other feature.
The final workflow:
We always try to keep our workshops fresh and interesting and visualizations are the best way to achieve this. Till the next workshop!
This week we held our first Girls Go Data Mining workshop. The workshop brought together curious women and intuitively introduced them to essential data mining and machine learning concepts. Of course, we used Orange to explore visualizations, build predictive models, perform clustering and dive into text analysis. The workshop was supported by NumFocus through their small development grant initiative and we hope to repeat it next year with even more ladies attending!
In two days, we covered many topics. On day one, we got to know Orange and the concept of visual programming, where the user construct analytical workflow by stacking visual components. Then we got to know several useful visualizations, such as box plot, scatter plot, distributions, and mosaic display, which give us an initial overview of the data and the potentially interesting patterns. Finally, we got our hands dirty with predictive modeling. We learnt about decision trees, logistic regression, and naive Bayes classifiers, and observed the models in tree viewer and nomogram. It is great having interpretable models and we had great fun exploring what is in the model!
On the second day, we tried to uncover groups in our data with clustering. First, we tried hierarchical clustering and explored the discovered clusters with box plot. Then we also tried k-means and learnt, why this method is better than hierarchical clustering. In the final part, we talked about the methods for text mining, how to do preprocessing, construct a bag of words and perform the machine learning on corpora. We used both clustering and classification and tried to find interesting information about Grimm tales.
One thing that always comes up as really useful in our workshops is Orange’s ability to output different types of data. For example, in Hierarchical Clustering, we can select the similarity cutoff at the top and output clusters. Our data table will have an additional column Cluster, with cluster labels for each data instance.
We can explore clusters by connecting a Box Plot to Hierarchical Clustering, selecting Cluster in Subgroups and using Order by relevance option. This sorts the variables in Box Plot by how well they separate between clusters or, in other words, what is typical of each cluster.
We used zoo.tab and made the cutoff at three clusters. It looks like the first cluster gives milk. Could these be a cluster of mammals?
Indeed it is!
Another option is to select a specific cluster in the dendrogram. Then, we have to rewire the connection between Hierarchical Clustering and Box Plot by setting it to Data. Data option will output the entire data set, with an extra column showing whether the data instance was selected or not. In our case, there would be a Yes if the instance is in the selected cluster and No if it is not.
Then we can use Box Plot to observe what is particular for our selected cluster.
It looks like animals from our selected cluster have feathers. Probably, this is a cluster of birds. We can check this with the same procedure as above.
In summary, most Orange visualizations have two outputs – Selected Data and Data. Selected Data will output a subset of data instances selected in the visualization (or selected clusters in the case of hierarchical clustering), while Data will output the entire data table with a column defining whether a data instance was selected or not. This is very useful if we want to inspect what is typical of an interesting group in our data, inspect clusters or even manually define groups.
Overall, this was another interesting workshop and we hope to continue our fruitful partnership with NumFocus and keep offering free educational events for beginners and experts alike!
Today we have finished a series of workshops for the Ministry of Public Affairs. This was a year-long cooperation and we had many students asking many different questions. There was however one that we talked about a lot. If I have a survey, how do I get it into Orange?
We are using EnKlik Anketa service, which is a great Slovenian product offering a wide array of options for the creation of surveys. We have created one such simple survey to use as a test. I am now inside EnKlik Anketa online service and I can see my survey has been successfully filled out.
Now I have to create a public link to my survey in order to access the data in Orange. I have to click on an icon in the top right part and select ‘Public link’.
A new window opens, where I select ‘Add new public link’. This will generate a public connection to my survey results. But be careful, the type of the connection needs to be Data, not Analysis! Orange can’t read already analyzed data, it needs raw data from Data pane.
Now, all I have to do is open Orange, place EnKlik Anketa widget from the Prototypes add-on onto the canvas, enter the public link into the ‘Public link URL’ fields and press Enter. If your data has loaded successfully, the widget will display available variables and information in the Info pane.
From here on you can continue your analysis just like you would with any other data source!
Last week Marko and I visited the land of the midnight sun – Norway! We held a two-day workshop on spectroscopy data analysis in Orange at the Norwegian University of Life Sciences. The students from BioSpec lab were yet again incredible and we really dug deep into Orange.
One thing we did was see how to join data from two different sources. It would often happen that you have measurements in one file and the labels in the other. Or in our case, we wanted to add images to our zoo.tab data. First, find the zoo.tab in the File widget under Browse documentation datasets. Observe the data in the Data Table.
This data contains 101 animal described with 16 different features (hair, aquatic, eggs, etc.), a name and a type. Now we will manually create the second table in Excel. The first column will contain the names of the animals as they appear in the original file. The second column will contain links to images of animals. Open your favorite browser and find a couple of images corresponding to selected animals. Then add links to images below the image column. Just like that:
Remember, you need a three-row header to define the column that contains images. Under the image column add string in the second and type=image in the third row. This will tell Orange where to look for images. Now, we can check our animals in Image Viewer.
Finally, it is time to bring in the images to the existing zoo data set. Connect the original File to Merge Data. Then add the second file with animal images to Merge Data. The default merging method will take the first data input as original data and the second data as extra data. The column to match by is defined in the widget. In our case, it is the name column. This means Orange will look at the first name column and find matching instances in the second name column.
A quick look at the merged data shows us an additional image column that we appended to the original file.
This is the final workflow. Merge Data now contains a single data table on the output and you can continue your analysis from there.
Find out more about spectroscopy for Orange on our YouTube channel or contribute to the project on Github.
Python Script is this mysterious widget most people don’t know how to use, even those versed in Python. Python Script is the widget that supplements Orange functionalities with (almost) everything that Python can offer. And it’s time we unveil some of its functionalities with a simple example.
Example: Batch Transform the Data
There might be a time when you need to apply a function to all your attributes. Say you wish to log-transform their values, as it is common in gene expression data. In theory, you could do this with Feature Constructor, where you would log-transform every attribute individually. Sounds laborious? It’s because it is. Why else we have computers if not to reduce manual labor for certain tasks? Let’s do it the fast way – with Python Script.
First, open File widget and load geo-gds360.tab from Browse documentation data sets. This data set has 9485 features, so imagine having to transform each feature individually.
Instead, we will connect Python Script to File and use a simple script to apply the same transformation to all attributes.
import numpy as np
from Orange.data import Table
new_X = np.log(in_data.X)
out_data = Table(in_data.domain, new_X, in_data.Y, in_data.metas)
This is really simple. Use in_data.X, which accesses all features in the data set, to transform the data with np.log (or any other numpy function). Set out_data to new_X and, voila, the transformed data is on the output. In a few lines we have instantly handled all 9485 features.
You can inspect the data before and after transformation in a Data Table widget.
This is it. Now we can do our standard analysis on the transformed data. Even better! We can save our script and use it in Python Script widget any time we want.
For your convenience I have already added the
Log Attributes Script, so you can download and use it instantly!
Have a more interesting example with Python Script? We’d love to hear about it!
Ever had a hard time telling the difference between Claude Monet and Édouard Manet? Orange can help you cluster these two authors and even more, discover which of Monet’s masterpiece is indeed very similar to Manet’s! Use Image Analytics add-on and play with it. Here’s how:
Janez and I have recently returned from a two-week stay in Moscow, Russian Federation, where we were teaching data mining to MA students of Applied Statistics. This is a new Master’s course that attracts the best students from different backgrounds and teaches them statistical methods for work in the industry.
It was a real pleasure working at HSE. The students were proactive by asking questions and really challenged us to do our best.
One of the things we did was compute minimum cost of misclassifications. The story goes like this. Sara is a doctor and has data on 303 patients with heart disease (Orange’s heart-disease.tab data set). She used some classifiers and now has to decide how many patients to send for further tests. Naive Bayes classifier, for example, returned probabilities of a patient being sick (column Naive Bayes 1). For each threshold in probabilites, she will compute how many false positives (patients declared sick when healthy) and how many false negatives (patients declared healthy when sick) a classifiers returns. Each mistake is associated with a cost. Now she wants to find out, how many patients to send for tests (what probability threshold to choose) so that her cost is the lowest.
First, import all the libraries we will need:
import matplotlib.pyplot as plt
import numpy as np
from Orange.data import Table
from Orange.classification import NaiveBayesLearner, TreeLearner
from Orange.evaluation import CrossValidation
Then load heart disease data (and print a sample).
heart = Table("heart_disease")
Now, train classifiers and select probabilities of Naive Bayes for a patient being sick.
scores = CrossValidation(heart, [NaiveBayesLearner(), TreeLearner()])
#take probabilites of class 1 (sick) of NaiveBayesLearner
p1 = scores.probabilities[:, 1]
#take actual class values
y = scores.actual
#cost of false positive (patient classified as sick when healthy)
fp_cost = 500
#cost of false negative (patient classified as healthy when sick)
fn_cost = 800
Set counts, where we declare 0 patients being sick (threshold >1).
fp = 0
#start with threshold above 1 (no one is sick)
fn = np.sum(y)
For each threshold, compute the cost associated with each type of mistake.
ps = 
costs = 
#compute costs of classifying i patients as sick
for i in np.argsort(p1)[::-1]:
if y[i] == 0:
fp += 1
fn -= 1
costs.append(fp * fp_cost + fn * fn_cost)
In the end, we get a list of probability thresholds and associated costs. Now let us find the minimum cost and its probability of a patient being sick.
costs = np.array(costs)
#find probability of a patient being sick at lowest cost
This means the threshold that minimizes our cost for a given classifier is 0.620655. Sara would send all the patients with a probability of being sick higher or equal than 0.620655 for further tests.
At the end, we can also plot the cost to patients sent curve.
Some of you might have an issue installing add-ons with the following issue popping up:
xmlrpc.client.Fault: <Fault -32601: 'server error; requested method not found'>
This is the result of the migration to a new infrastructure at PyPi, which provides the installation of add-ons. Our team has rallied to adjust the add-on installer so it works with the new and improved service.
In order to make the add-on installer work (again), please download the latest version of Orange (3.13.0).
We apologize for any inconvenience and wish you a fruitful data analysis in the future.
Have you ever tried Orange with data big enough that some widgets ran for more than a second? Then you have seen it: Orange froze. While the widget was processing, the interface would not respond to any inputs, and there was no way to stop that widget.
Not all the widgets freeze, though! Some widgets, like Test & Score, k-Means, or Image Embedding, do not block. While they are working, we are free to build other parts of the workflow, and these widgets also show their progress. Some, like Image Embedding, which work with lots of images, even allow interruptions.
Why does Orange freeze? Most widgets process users’ actions directly: after an event (click, pressed key, new input data) some code starts running: until it finishes, the interface can not respond to any new events. This is a reasonable approach for short tasks, such as making a selection in a Scatter Plot. But with longer tasks, such as building a Support Vector Model on big data, Orange gets unresponsive.
To make Orange responsive while it is processing, we need to start the task in a new thread. As programmers we have to consider the following:
1. Starting the task. We have to make sure that other (older) tasks are not running.
2. Showing results when the task has finished.
3. Periodic communication between the task and the interface for status reports (progress bars) and task stopping.
Starting the task and showing the results are straightforward and well documented in a tutorial for writing widgets. Periodic communication with stopping is harder: it is completely task-dependent and can be either trivial, hard, or even impossible. Periodic communication is, in principle, unessential for responsiveness, but if we do not implement it, we will be unable to stop the running task and progress bars would not work either.
Taking care of periodic communication was the hardest part of making the Neural Network widget responsive. It would have been easy, had we implemented neural networks ourselves. But we use the scikit-learn implementation, which does not expose an option to make additional function calls while fitting the network (we need to run code that communicates with the interface). We had to resort to a trick: we modified fitting so that a change to an attribute called n_iters_ called a function (see pull request). Not the cleanest solution, but it seems to work.
For now, only a few widgets work so that the interface remains responsive. We are still searching for the best way to make existing widgets behave nicely, but responsiveness is now one of our priorities.
We have just concluded our enhanced Introduction to Data Science workshop, which included several workflows for spectroscopy analysis. Spectroscopy add-on is intended for the analysis of spectral data and it is just as fun as our other add-ons (if not more!).
We will prove it with a simple classification workflow. First, install Spectroscopy add-on from Options – Add-ons menu in Orange. Restart Orange for the add-on to appear. Great, you are ready for some spectral analysis!
Use Datasets widget and load Collagen spectroscopy data. This data contains cells measured with FTIR and annotated with the major chemical compound at the imaged part of a cell. A quick glance in a Data Table will give us an idea how the data looks like. Seems like a very standard spectral data set.
Now we want to determine, whether we can classify cells by type based on their spectral profiles. First, connect Datasets to Test & Score. We will use 10-fold cross-validation to score the performance of our model. Next, we will add Logistic Regression to model the data. One final thing. Spectral data often needs some preprocessing. Let us perform a simple preprocessing step by applying Cut (keep) filter and retaining only the wave numbers between 1500 and 1800. When we connect it to Test & Score, we need to keep in mind to connect the Preprocessor output of Preprocess Spectra.
Let us see how well our model performs. Not bad. A 0.99 AUC score. Seems like it is almost perfect. But is it really so?
Confusion Matrix gives us a detailed picture. Our model fails almost exclusively on DNA cell type. Interesting.
We will select the misclassified DNA cells and feed them to Spectra to inspect what went wrong. Instead of coloring by type, we will color by prediction from Logistic Regression. Can you find out why these spectra were classified incorrectly?
This is one of the simplest examples with spectral data. It is basically the same procedure as with standard data – data is fed as data, learner (LR) as learner and preprocessor as preprocessor directly to Test & Score to avoid overfitting. Play around with Spectroscopy add-on and let us know what you think! 🙂