Miniconda Installer

Orange has a new friend! It’s Miniconda, Anaconda’s little sister.

 

For a long time, the idea was to utilize the friendly nature of Miniconda to install Orange dependencies, which often misbehaved on some platforms. Miniconda provides Orange with Python 3.6 and conda installer, which is then used to handle everything Orange needs for proper functioning. So sssssss-mooth!

Miniconda Installer

Please know that our Miniconda installer is in a beta state, but we are inviting adventurous testers to try it and report any bugs they find to our issue tracker [there won’t be any of course! ūüėČ ].

 

Happy testing! ūüźć|ūüćä

 

 

Text Preprocessing

In data mining, preprocessing is key. And in text mining, it is the key and the door. In other words, it’s the most vital step in the analysis.

Related: Text Mining add-on

So what does preprocessing do? Let’s have a look at an example. Place Corpus widget from Text add-on on the canvas. Open it and load Grimm-tales-selected. As always, first have a quick glance of the data in Corpus Viewer. This data set contains 44 selected Grimms’ tales.

Now, let us see the most frequent words of this corpus in a Word Cloud.

Ugh, what a mess! The most frequent words in these texts are conjunctions (‘and’, ‘or’) and prepositions (‘in’, ‘of’), but so they are in almost every English text in the world. We need to remove these frequent and uninteresting words to get to the interesting part. We remove the punctuation by defining our tokens. Regexp \w+ will keep full words and omit everything else. Next, we filter out the uninteresting words with a list of stopwords. The list is pre-set by nltk package and contains frequently occurring conjunctions, prepositions, pronouns, adverbs and so on.

Ok, we did some essential preprocessing. Now let us observe the results.

This does look much better than before! Still, we could be a bit more precise. How about removing the words could, would, should and perhaps even said, since it doesn’t say much about the content of the tale? A custom list of stopwords would come in handy!

Open a plain text editor, such as Notepad++ or Sublime, and place each word you wish to filter on a separate line.

Save the file and load it next to the pre-set stopword list.

One final check in the Word Cloud should reveal we did a nice job preparing our data. We can now see the tales talk about kings, mothers, fathers, foxes and something that is little. Much more informative!

Related: Workshop: Text Analysis for Social Scientists

Workshop: Text Analysis for Social Scientists

Yesterday was no ordinary day at the Faculty of Computer and Information Science, University of Ljubljana – there was an unusually high proportion of Social Sciences students, researchers and other professionals in our classrooms. It was all because of a Text Analysis for Social Scientists workshop.

Related: Data Mining for Political Scientists

Text mining is becoming a popular method across sciences and it was time to showcase what it (and Orange) can do. In this 5-hour hands-on workshop we explained text preprocessing, clustering, and predictive models, and applied them in the analysis of selected Grimm’s Tales. We discovered that predictive models can nicely distinguish between animal tales and tales of magic and that foxes and kings play a particularly important role in separating between the two types.

Nomogram displays 6 most important words (attributes) as defined by Logistic Regression. Seems like the occurrence of the word ‘fox’ can tell us a lot about whether the text is an animal tale or a tale of magic.

Related: Nomogram

The second part of the workshop was dedicated to the analysis of tweets – we learned how to work with thousands of tweets on a personal computer, we plotted them on a map by geolocation, and used Instagram images for image clustering.

Related: Image Analytics: Clustering

Five hours was very little time to cover all the interesting topics in text analytics. But Orange came to the rescue once again. Interactive visualization and the possibility of close reading in Corpus Viewer were such a great help! Instead of reading 6400 tweets ‘by hand’, now the workshop participants can cluster them in interesting groups, find important words in each cluster and plot them in a 2D visualization.

Participants at work.

Here, we’d like to thank NumFocus for providing financial support for the course. This enabled us to bring in students from a wide variety of fields (linguists, geographers, marketers) and prove (once again) that you don’t have to be a computer scientists to do machine learning!

 

Nomogram

One more exciting visualization has been introduced to Orange Рa Nomogram. In general, nomograms are graphical devices that can approximate the calculation of some function. A Nomogram widget in Orange visualizes Logistic Regression and Naive Bayes classification models, and compute the class probabilities given a set of attributes values. In the nomogram, we can check how changing of the attribute values affect the class probabilities, and since the widget (like widgets in Orange) is interactive, we can do this on the fly.

So, how does it work? First, feed the Nomogram a classification model, say, Logistic Regression. We will use the Titanic survival data that comes with Orange for this example (in File widget, choose “Browse documentation datasets”).

In the nomogram, we see the top ranked attributes and how much they contribute to the target class. Seems like a male third class adult had a much lower survival rate than did female first class child.

The first box show the target class, in our case survived=no. The second box shows the most important attribute, sex, and its contribution to the probability of the target class (more for male, almost 0 for female). The final box shows the total probability of the target class for the selected values of attributes (blue dots).

The most important attribute, however, seems to be ‘sex’, where the chance for survival (target class = no) is lower for males than it is for females. How do I know? Grab the blue dot over the attribute and drag it from ‘male’ to ‘female’. The total probability for dying on Titanic (survived=no) drops from 89% to 43%.

The same goes for all the other attributes – you can interactively explore how much a certain value contributes to the probability of a selected target class.

But it gets even better! Instead of dragging the blue dots in the nomogram, you can feed it the data. In the workflow below, we pass the data through the Data Table widget and then feed the selected data instance to the Nomogram. The Nomogram would then show what is the probability of the target class for this particular instance, and it would “explain” what are the magnitudes of contributions of individual attribute values.

This makes Nomogram a great widget for understanding the model and for interactive data exploration.

Outliers in Traffic Signs

Say I am given a collection of images of traffic signs, and would like to find which signs stick out. That is, which traffic signs look substantially different from the others. I would assume that the traffic signs are not equally important and that some were designed to be noted before the others.

I have assembled a small set of regulatory and warning traffic signs and stored the references to their images in a traffic-signs-w.tab data set.

Related: Viewing images

Related: Video on image clustering

Related: Video on image classification

The easiest way to display the images is by loading this data file with File widget and then passing the data to the Image Viewer,

Opening the Image Viewer allows me to see the images:

Note that initially the data table we have loaded contains no valuable features on which we can do any machine learning. It includes just a category of traffic sign, its name, and the link to its image.

We will use deep-network embedding to turn these images into numbers to describe them with 2048 real-valued features. Then, we will use Silhouette Plot to find which traffic signs are outliers in their own group. We would like to select these and visualize them in the Image Viewer.

Related: All I see is silhouette

Our final workflow, with selection of three biggest outliers (we used shift-click to select its corresponding silhouettes in the Silhouette Plot), is:

Isn’t this great? Turns out that traffic signs were carefully designed, such that the three outliers are indeed the signs we should never miss. It is great that we can now reconfirm this design choice by deep learning-based embedding and by using some straightforward machine learning tricks such as Silhouette Plot.

Model replaces Classify and Regression

Did you recently wonder where did Classification Tree go? Or what happened to Majority?

Orange 3.4.0 introduced a new widget category, Model, which now contains all supervised learning algorithms in one place and replaces the separate Classify and Regression categories.

    

This, however, was not a mere cosmetic change to the widget hierarchy. We wanted to simplify the interface for new users and make finding an appropriate learning algorithm easier. Moreover, now you can reuse some workflows on different data sets, say housing.tab and iris.tab!

Leading up to this change, many algorithms were refactored so that regression and classification versions of the same method were merged into a single widget (and class in the underlying python API). For example, Classification Tree and Regression Tree have become simply Tree, which is capable of modelling categorical or numeric target variables. And similarly for SVM, kNN, Random Forest, …

Have you ever searched for a widget by typing its name and were confused by multiple options appearing in the search box? Now you do not need to decide if you need Classification SVM or Regression SVM, you can just select SVM and enjoy the rest of the time doing actual data analysis!

 

Here is a quick wrap-up:

  • Majority and Mean became Constant.
  • Classification Tree and Regression Tree became Tree. In the same manner, Random Forest and Regression Forest became Random Forest.
  • SVM, SGD, AdaBoost and kNN now work for both classification and regression tasks.
  • Linear Regression only works for regression.
  • Logistic Regression, Naive Bayes and CN2 Rule Induction only work for classification.

Sorry about the last part, we really couldn‚Äôt do anything about the very nature of these algorithms! ūüôā

 
      

Image Analytics: Clustering

Data does not always come in a nice tabular form. It can also be a collection of text, audio recordings, video materials or even images. However, computers can only work with numbers, so for any data mining, we need to transform such unstructured data into a vector representation.

For retrieving numbers from unstructured data, Orange can use deep network embedders. We have just started to include various embedders in Orange, and for now, they are available for text and images.

Related: Video on image clustering

Here, we give an example of image embedding and show how easy is to use it in Orange. Technically, Orange would¬†send the image to the server, where the server would push an image through a pre-trained deep neural network, like Google’s Inception v3. Deep networks were most often trained with some special purpose in mind. Inception v3, for instance, can classify images into any of 1000 image classes. We can disregard the classification, consider instead the penultimate layer of the network with 2048 nodes (numbers) and use that for image’s vector-based representation.

Let’s see this on an example.

Here we have 19 images of domestic animals. First, download the images and unzip them. Then use Import Images widget from Orange’s Image Analytics add-on and open the directory containing the images.

We can visualize images in Image Viewer widget. Here is our workflow so far, with images shown in Image Viewer:

But what do we see in a data table? Only some useless description of images (file name, the location of the file, its size, and the image width and height).

This cannot help us with machine learning. As I said before, we need numbers. To acquire numerical representation of these images, we will send the images to Image Embedding widget.

Great! Now we have the numbers we wanted. There are 2048 of them (columns n0 to n2047). From now on, we can apply all the standard machine learning techniques, say, clustering.

Let us measure the distance between these images and see which are the most similar. We used Distances widget to measure the distance. Normally, cosine distance works best for images, but you can experiment on your own. Then we passed the distance matrix to Hierarchical Clustering to visualize similar pairs in a dendrogram.

This looks very promising! All the right animals are grouped together. But I can’t see the results so well in the dendrogram. I want to see the images – with Image Viewer!

So cool! All the cow family is grouped together! Now we can click on different branches of the dendrogram and observe which animals belong to which group.

But I know what you are going to say. You are going to say I am cheating. That I intentionally selected similar images to trick you.

I will prove you wrong. I will take a new cow, say, the most famous cow in Europe – Milka cow.

This image is quite different from the other images – it doesn’t have a white background, it’s a real (yet photoshopped) photo and the cow is facing us. Will the Image Embedding find the right numerical representation for this cow?

Indeed it has. Milka is nicely put together with all the other cows.

Image analytics is such an exciting field in machine learning and now Orange is a part of it too! You need to install the Image Analytics add on and you are all set for your research!

k-Means & Silhouette Score

k-Means is one of the most popular unsupervised learning algorithms for finding interesting groups in our data. It can be useful in customer segmentation, finding gene families, determining document types, improving human resource management and so on.

But… have you ever wondered how k-means works? In the following three videos we explain how to construct a data analysis workflow using k-means, how k-means works, how to find a good k value and how silhouette score can help us find the inliers and the outliers.

 

#1 Constructing workflow with k-means

#2 How k-means works [interactive visualization]

#3 How silhouette score works and why it is useful

Why Orange?

Why is Orange so great? Because it helps people solve problems quickly and efficiently.

SaŇ°o Jakljevińć, a former student of the Faculty of Computer and Information Science at University of Ljubljana, created the following motivational videos for his graduation thesis. He used two belowed datasets, iris and zoo, to showcase how to tackle real-life problems with Orange.

Workshop on InfraOrange

Thanks to the collaboration with synchrotrons Elettra (Trieste) and Soleil (Paris), Orange is getting an add-on InfraOrange, with widgets for analysis of infrared spectra. Its primary users obviously come from these two institutions, hence we organized the first workshop for InfraOrange at one of them.

Some 20 participants spent the first day of the workshop in Trieste learning the basics of Orange and its use for data mining. With Janez at the helm and Marko assisting in the back, we traversed the standard list of visual and statistical techniques and a bit of unsupervised and supervised learning. The workshop was perhaps a bit unusual as most attendees were already quite familiar with these methods, but most haven’t yet used them in such an interactive fashion.

Marko explaining how to analyze spectral data with Orange.

 

On the second day Marko and Andrej took over and focused on the analysis of spectral data. We demonstrated the use of widgets specifically developed for infrared data and used them with data mining techniques we covered on the first day. After lunch the attendees tried to work on their own data sets, which was a real test for InfraOrange.

Orange for spectral data.

 

Group photo!

 

We now have a lot of realistic feedback on what to improve. There is a lot of work to do still, but a week after the workshop the most often occurring bugs have already been fixed.

The future of InfraOrange looks bright and…. khm… well, colorful! ūüôā