Some of you might have an issue installing add-ons with the following issue popping up:
xmlrpc.client.Fault: <Fault -32601: 'server error; requested method not found'>
This is the result of the migration to a new infrastructure at PyPi, which provides the installation of add-ons. Our team has rallied to adjust the add-on installer so it works with the new and improved service.
In order to make the add-on installer work (again), please download the latest version of Orange (3.13.0).
We apologize for any inconvenience and wish you a fruitful data analysis in the future.
We have just concluded our enhanced Introduction to Data Science workshop, which included several workflows for spectroscopy analysis. Spectroscopy add-on is intended for the analysis of spectral data and it is just as fun as our other add-ons (if not more!).
We will prove it with a simple classification workflow. First, install Spectroscopy add-on from Options – Add-ons menu in Orange. Restart Orange for the add-on to appear. Great, you are ready for some spectral analysis!
Use Datasets widget and load Collagen spectroscopy data. This data contains cells measured with FTIR and annotated with the major chemical compound at the imaged part of a cell. A quick glance in a Data Table will give us an idea how the data looks like. Seems like a very standard spectral data set.
Now we want to determine, whether we can classify cells by type based on their spectral profiles. First, connect Datasets to Test & Score. We will use 10-fold cross-validation to score the performance of our model. Next, we will add Logistic Regression to model the data. One final thing. Spectral data often needs some preprocessing. Let us perform a simple preprocessing step by applying Cut (keep) filter and retaining only the wave numbers between 1500 and 1800. When we connect it to Test & Score, we need to keep in mind to connect the Preprocessor output of Preprocess Spectra.
Let us see how well our model performs. Not bad. A 0.99 AUC score. Seems like it is almost perfect. But is it really so?
Confusion Matrix gives us a detailed picture. Our model fails almost exclusively on DNA cell type. Interesting.
We will select the misclassified DNA cells and feed them to Spectra to inspect what went wrong. Instead of coloring by type, we will color by prediction from Logistic Regression. Can you find out why these spectra were classified incorrectly?
This is one of the simplest examples with spectral data. It is basically the same procedure as with standard data – data is fed as data, learner (LR) as learner and preprocessor as preprocessor directly to Test & Score to avoid overfitting. Play around with Spectroscopy add-on and let us know what you think! 🙂
A lot of you have been interested in enabling SQL widget in Orange, especially regarding the installation of a psycopg backend that makes the widget actually work. This post will be slightly more technical, but I will try to keep it to a minimum. Scroll to the bottom for installation instructions.
The main issue for some people was that despite having installed the psycopg module in their console, the SQL widget still didn’t work. This is because Orange uses a separate virtual environment and most of you installed psycopg in the default (system) Python environment. For psycopg to be recognized in Orange, it needs to be installed in the same virtual environment, which is normally located in C:\Users\<usr>\Anaconda3\envs\orange3 (on Windows). For the installation to work, you’d have to run it with the proper pip, namely:
But there is a much easier way to do it. Head over to psycopg’s pip website and download the latest wheel for your platform. Py version has to be cp34 or higher (latest Orange from Anaconda comes with Python 3.6, so look for cp36).
This week, Primož and I flew to the south of Italy to hold a workshop on Image Analytics through Data Mining at AIUCD 2018 conference. The workshop was intended to familiarize digital humanities researchers with options that visual programming environments offer for image analysis.
In about 5 hours we discussed image embedding, clustering, finding closest neighbors and classification of images. While it is often a challenge to explain complex concepts in such a short time, it is much easier when working with Orange.
One of the workflows we learned at the workshop was the one for finding the most similar image in a set of images. This is better explained with an example.
We had 15 paintings from different authors. Two of them were painted by Claude Monet, a famous French impressionist painter. Our task was, given a reference image of Monet, to find his other painting in a collection.
Once we have described our paintings with vectors (embeddings), we can compare them by similarity. To find the second Monet in a data set, we will have to compute the similarity of paintings and find the one most similar one to our reference painting.
Let us connect Image Embedding to Neighbors from Prototypes add-on. Neighbors widget is specifically intended to find a number of closest neighbors given a reference data point.
We will need to adjust the widget a bit. First, we will need cosine distance, since we will be comparing images by the content, not the magnitude of features. Next, we will tick off Exclude reference, in order to receive the reference image on the output. We do this just for visualization purposes. Finally, we set the number of neighbors to 2. Again, this is just for a nicer visualization, since we know there are only two Monet’s paintings in the data set.
Then we need to give Neighbors a reference image, for which we want to retrieve the neighbors. We do this by adding Data Table to Image Embedding, selecting one of Monet’s paintings in the spreadsheet and then connecting the Data Table to Neighbors. The widget will automatically consider the second input as a reference.
Now, all we need to do is to visualize the output. Connect Image Viewer to Neighbors and open it.
Voila! The widget has indeed found the second Monet’s painting. So useful when you have thousands of images in your archive!
Scatter plots are great! But sometimes, we need to plot more than two variables to truly understand the data. How can we achieve this, knowing humans can only grasp up to three dimensions? With an optimization of linear projection, of course!
Orange recently re-introduced FreeViz, an interactive visualization for plotting multiple variables on a 2-D plane.
Let’s load zoo.tab data with File widget and connect FreeViz to it. Zoo data has 16 features describing animals of different types – mammals, amphibians, insects and so on. We would like to use FreeViz to show us informative features and create a visualization that separates well between animal types.
We start with un-optimized projection, where data points are scattered around features axes. Once we click Optimize, we can observe optimization process in real-time and at the end see the optimized projection.
This projection is much more informative. Mammals are nicely grouped together within a pink cluster that is characterized by hair, milk, and toothed features. Conversely, birds are charaterized by eggs, feathers and airborne, while fish are aquatic. Results are as expected, which means optimization indeed found informative features for each class value.
Since we are working with categorical class values, we can tick Show class density to color the plot by majority class values. We can also move anchors around to see how data points change in relation to a selected anchor.
Finally, as in most Orange visualizations, we can select a subset of data points and explore them further. For example, let us observe which amphibians are characterized by being aquatic in a Data Table. A newt, a toad and two types of frogs, one venomous and one not.
Data exploration is always much easier with clever visualizations!
We all know that sometimes many is better than few. Therefore we are happy to introduce the Stack widget. It is available in Prototypes add-on for now.
Stacking enables you to combine several trained models into one meta model and use it in Test&Score just like any other model. This comes in handy with complex problems, where one classifier might fail, but many could come up with something that works. Let’s see an example.
We start with something as complex as this. We used Paint Data to create a complex data set, where classes somewhat overlap. This is naturally an artificial example, but you can try the same on your own, real life data.
Then we add several kNN models with different parameters, say 5, 10 and 15 neighbors. We connect them to Test&Score and use cross validation to evaluate their performance. Not bad, but can we do even better?
Let us try stacking. We will connect all three classifiers to the Stacking widget and use Logistic Regression as an aggregate, a method that aggregates the three models into a single meta model. Then we connect connect the stacked model into Test&Score and see whether our scores improved.
And indeed they have. It might not be anything dramatic, but in real life, say medical context, even small improvements count. Now go and try the procedure on your own data. In Orange, this requires only a couple of minutes.
On Monday we finished the second part of the workshop for the Statistical Office of Republic of Slovenia. The crowd was tough – these guys knew their numbers and asked many challenging questions. And we loved it!
One thing we discussed was how to properly test your model. Ok, we know never to test on the same data you’ve built your model with, but even training and testing on separate data is sometimes not enough. Say I’ve tested Naive Bayes, Logistic Regression and Tree. Sure, I can select the one that gives the best performance, but we could potentially (over)fit our model, too.
To account for this, we would normally split the data to 3 parts:
training data for building a model
validation data for testing which parameters and which model to use
test data for estmating the accurracy of the model
Let us try this in Orange. Load heart-disease.tab data set from Browse documentation data sets in File widget. We have 303 patients diagnosed with blood vessel narrowing (1) or diagnosed as healthy (0).
Now, we will split the data into two parts, 85% of data for training and 15% for testing. We will send the first 85% onwards to build a model.
We will use Naive Bayes, Logistic Regression and Tree, but you can try other models, too. This is also a place and time to try different parameters. Now we will send the models to Test & Score. We used cross-validation and discovered Logistic Regression scores the highest AUC. Say this is the model and parameters we want to go with.
Now it is time to bring in our test data (the remaining 15%) for testing. Connect Data Sampler to Test & Score once again and set the connection Remaining Data – Test Data.
Test & Score will warn us we have test data present, but unused. Select Test on test data option and observe the results. These are now the proper scores for our models.
Seems like LogReg still performs well. Such procedure would normally be useful when testing a lot of models with different parameters (say +100), which you would not normally do in Orange. But it’s good to know how to do the scoring properly. Now we’re off to report on the results in Nature… 😉
We’ve been having a blast with recent Orange workshops. While Blaž was getting tanned in India, Anže and I went to the charming Liverpool to hold a session for business school professors on how to teach business with Orange.
Obviously, when we say teach business, we mean how to do data mining for business, say predict churn or employee attrition, segment customers, find which items to recommend in an online store and track brand sentiment with text analysis.
For this purpose, we have made some updates to our Associate add-on and added a new data set to Data Sets widget which can be used for customer segmentation and discovering which item groups are frequently bought together. Like this:
We load the Online Retail data set.
Since we have transactions in rows and items in columns, we have to transpose the data table in order to compute distances between items (rows). We could also simply ask Distances widget to compute distances between columns instead of rows. Then we send the transposed data table to Distances and compute cosine distance between items (cosine distance will only tell us, which items are purchased together, disregarding the amount of items purchased).
Finally, we observe the discovered clusters in Hierarchical Clustering. Seems like mugs and decorative signs are frequently bought together. Why so? Select the group in Hierarchical Clustering and observe the cluster in a Data Table. Consider this an exercise in data exploration. 🙂
The second workshop was our standard Introduction to Data Mining for Ministry of Public Affairs.
This group, similar to the one from India, was a pack of curious individuals who asked many interesting questions and were not shy to challenge us. How does a Tree know which attribute to split by? Is Tree better than Naive Bayes? Or is perhaps Logistic Regression better? How do we know which model works best? And finally, what is the mean of sauerkraut and beans? It has to be jota!
Workshops are always fun, when you have a curious set of individuals who demand answers! 🙂
We know you’ve missed it. We’ve been getting many requests to bring back Neural Network widget, but we also had many reservations about it.
Neural networks are powerful and great, but to do them right is not straight-forward. And to do them right in the context of a GUI-based visual programming tool like Orange is a twisted double helix of a roller coaster.
Do we make each layer a widget and then stack them? Do we use parallel processing or try to do something server-side? Theano or Keras? Tensorflow perhaps?
We were so determined to do things properly, that after the n-th iteration we still had no clue what to actually do.
Then one day a silly novice programmer (a.k.a. me) had enough and just threw scikit-learn’s Multi-layer Perceptron model into a widget and called it a day. There you go. A Neural Network widget just like it was in Orange2 – a wrapper for a scikit’s function that works out-of-the-box. Nothing fancy, nothing powerful, but it does its job. It models things and it predicts things.
Our streak of workshops continues. This time we taught professionals from public administration how they can leverage data analytics and machine learning to retrieve interesting information from surveys. Thanks to the Ministry of Public Administration, this is only the first in a line of workshops on data science we are preparing for public sector employees.
For this purpose, we have designed EnKlik Anketa widget, which you can find in Prototypes add-on. The widget reads data from a Slovenian online survey service OneClick Survey and imports the results directly into Orange.
We have prepared a test survey, which you can import by entering a public link to data into the widget. Here’s the link: https://www.1ka.si/podatki/141025/72F5B3CC/ . Copy it into the Public link URL line in the widget. Once you press Enter, the widget loads the data and displays retrieved features, just like the File widget.
The survey is in Slovenian, but we can use Edit Domain to turn feature names into English equivalent.
As always, we can check the data in a Data Table. We have 41 respondents and 7 questions. Each respondent chose a nickname, which makes it easier to browse the data.
Now we can perform familiar clustering to uncover interesting groups in our data. Connect Distances to Edit Domain and Hierarchical Clustering to Distances.
We have two outliers, Pipi and Chad. One is an excessive sportsman (100 h of sport per week) and the other terminally ill (general health -1). Or perhaps they both simply didn’t fill out the survey correctly. If we use the Data Table to filter out Pipi and Chad, we get a fairly good clustering.
We can use Box Plot, to observe what makes each cluster special. Connect Box Plot to Hierarchical Clustering (with the two groups selected), select grouping by Cluster and tick Order by relevance.
Seems like our second cluster (C2) is the sporty one. If we are serving in the public administration, perhaps we can design initiatives targeting cluster C1 to do more sports. It is so easy to analyze the data in Orange!