Workshop: Text Analysis for Social Scientists

Yesterday was no ordinary day at the Faculty of Computer and Information Science, University of Ljubljana – there was an unusually high proportion of Social Sciences students, researchers and other professionals in our classrooms. It was all because of a Text Analysis for Social Scientists workshop.

Related: Data Mining for Political Scientists

Text mining is becoming a popular method across sciences and it was time to showcase what it (and Orange) can do. In this 5-hour hands-on workshop we explained text preprocessing, clustering, and predictive models, and applied them in the analysis of selected Grimm’s Tales. We discovered that predictive models can nicely distinguish between animal tales and tales of magic and that foxes and kings play a particularly important role in separating between the two types.

Nomogram displays 6 most important words (attributes) as defined by Logistic Regression. Seems like the occurrence of the word ‘fox’ can tell us a lot about whether the text is an animal tale or a tale of magic.

Related: Nomogram

The second part of the workshop was dedicated to the analysis of tweets – we learned how to work with thousands of tweets on a personal computer, we plotted them on a map by geolocation, and used Instagram images for image clustering.

Related: Image Analytics: Clustering

Five hours was very little time to cover all the interesting topics in text analytics. But Orange came to the rescue once again. Interactive visualization and the possibility of close reading in Corpus Viewer were such a great help! Instead of reading 6400 tweets ‘by hand’, now the workshop participants can cluster them in interesting groups, find important words in each cluster and plot them in a 2D visualization.

Participants at work.

Here, we’d like to thank NumFocus for providing financial support for the course. This enabled us to bring in students from a wide variety of fields (linguists, geographers, marketers) and prove (once again) that you don’t have to be a computer scientists to do machine learning!

 

Nomogram

One more exciting visualization has been introduced to Orange – a Nomogram. In general, nomograms are graphical devices that can approximate the calculation of some function. A Nomogram widget in Orange visualizes Logistic Regression and Naive Bayes classification models, and compute the class probabilities given a set of attributes values. In the nomogram, we can check how changing of the attribute values affect the class probabilities, and since the widget (like widgets in Orange) is interactive, we can do this on the fly.

So, how does it work? First, feed the Nomogram a classification model, say, Logistic Regression. We will use the Titanic survival data that comes with Orange for this example (in File widget, choose “Browse documentation datasets”).

In the nomogram, we see the top ranked attributes and how much they contribute to the target class. Seems like a male third class adult had a much lower survival rate than did female first class child.

The first box show the target class, in our case survived=no. The second box shows the most important attribute, sex, and its contribution to the probability of the target class (more for male, almost 0 for female). The final box shows the total probability of the target class for the selected values of attributes (blue dots).

The most important attribute, however, seems to be ‘sex’, where the chance for survival (target class = no) is lower for males than it is for females. How do I know? Grab the blue dot over the attribute and drag it from ‘male’ to ‘female’. The total probability for dying on Titanic (survived=no) drops from 89% to 43%.

The same goes for all the other attributes – you can interactively explore how much a certain value contributes to the probability of a selected target class.

But it gets even better! Instead of dragging the blue dots in the nomogram, you can feed it the data. In the workflow below, we pass the data through the Data Table widget and then feed the selected data instance to the Nomogram. The Nomogram would then show what is the probability of the target class for this particular instance, and it would “explain” what are the magnitudes of contributions of individual attribute values.

This makes Nomogram a great widget for understanding the model and for interactive data exploration.

Outliers in Traffic Signs

Say I am given a collection of images of traffic signs, and would like to find which signs stick out. That is, which traffic signs look substantially different from the others. I would assume that the traffic signs are not equally important and that some were designed to be noted before the others.

I have assembled a small set of regulatory and warning traffic signs and stored the references to their images in a traffic-signs-w.tab data set.

Related: Viewing images

Related: Video on image clustering

Related: Video on image classification

The easiest way to display the images is by loading this data file with File widget and then passing the data to the Image Viewer,

Opening the Image Viewer allows me to see the images:

Note that initially the data table we have loaded contains no valuable features on which we can do any machine learning. It includes just a category of traffic sign, its name, and the link to its image.

We will use deep-network embedding to turn these images into numbers to describe them with 2048 real-valued features. Then, we will use Silhouette Plot to find which traffic signs are outliers in their own group. We would like to select these and visualize them in the Image Viewer.

Related: All I see is silhouette

Our final workflow, with selection of three biggest outliers (we used shift-click to select its corresponding silhouettes in the Silhouette Plot), is:

Isn’t this great? Turns out that traffic signs were carefully designed, such that the three outliers are indeed the signs we should never miss. It is great that we can now reconfirm this design choice by deep learning-based embedding and by using some straightforward machine learning tricks such as Silhouette Plot.

Model replaces Classify and Regression

Did you recently wonder where did Classification Tree go? Or what happened to Majority?

Orange 3.4.0 introduced a new widget category, Model, which now contains all supervised learning algorithms in one place and replaces the separate Classify and Regression categories.

    

This, however, was not a mere cosmetic change to the widget hierarchy. We wanted to simplify the interface for new users and make finding an appropriate learning algorithm easier. Moreover, now you can reuse some workflows on different data sets, say housing.tab and iris.tab!

Leading up to this change, many algorithms were refactored so that regression and classification versions of the same method were merged into a single widget (and class in the underlying python API). For example, Classification Tree and Regression Tree have become simply Tree, which is capable of modelling categorical or numeric target variables. And similarly for SVM, kNN, Random Forest, …

Have you ever searched for a widget by typing its name and were confused by multiple options appearing in the search box? Now you do not need to decide if you need Classification SVM or Regression SVM, you can just select SVM and enjoy the rest of the time doing actual data analysis!

 

Here is a quick wrap-up:

  • Majority and Mean became Constant.
  • Classification Tree and Regression Tree became Tree. In the same manner, Random Forest and Regression Forest became Random Forest.
  • SVM, SGD, AdaBoost and kNN now work for both classification and regression tasks.
  • Linear Regression only works for regression.
  • Logistic Regression, Naive Bayes and CN2 Rule Induction only work for classification.

Sorry about the last part, we really couldn’t do anything about the very nature of these algorithms! 🙂

 
      

Image Analytics: Clustering

Data does not always come in a nice tabular form. It can also be a collection of text, audio recordings, video materials or even images. However, computers can only work with numbers, so for any data mining, we need to transform such unstructured data into a vector representation.

For retrieving numbers from unstructured data, Orange can use deep network embedders. We have just started to include various embedders in Orange, and for now, they are available for text and images.

Related: Video on image clustering

Here, we give an example of image embedding and show how easy is to use it in Orange. Technically, Orange would send the image to the server, where the server would push an image through a pre-trained deep neural network, like Google’s Inception v3. Deep networks were most often trained with some special purpose in mind. Inception v3, for instance, can classify images into any of 1000 image classes. We can disregard the classification, consider instead the penultimate layer of the network with 2048 nodes (numbers) and use that for image’s vector-based representation.

Let’s see this on an example.

Here we have 19 images of domestic animals. First, download the images and unzip them. Then use Import Images widget from Orange’s Image Analytics add-on and open the directory containing the images.

We can visualize images in Image Viewer widget. Here is our workflow so far, with images shown in Image Viewer:

But what do we see in a data table? Only some useless description of images (file name, the location of the file, its size, and the image width and height).

This cannot help us with machine learning. As I said before, we need numbers. To acquire numerical representation of these images, we will send the images to Image Embedding widget.

Great! Now we have the numbers we wanted. There are 2048 of them (columns n0 to n2047). From now on, we can apply all the standard machine learning techniques, say, clustering.

Let us measure the distance between these images and see which are the most similar. We used Distances widget to measure the distance. Normally, cosine distance works best for images, but you can experiment on your own. Then we passed the distance matrix to Hierarchical Clustering to visualize similar pairs in a dendrogram.

This looks very promising! All the right animals are grouped together. But I can’t see the results so well in the dendrogram. I want to see the images – with Image Viewer!

So cool! All the cow family is grouped together! Now we can click on different branches of the dendrogram and observe which animals belong to which group.

But I know what you are going to say. You are going to say I am cheating. That I intentionally selected similar images to trick you.

I will prove you wrong. I will take a new cow, say, the most famous cow in Europe – Milka cow.

This image is quite different from the other images – it doesn’t have a white background, it’s a real (yet photoshopped) photo and the cow is facing us. Will the Image Embedding find the right numerical representation for this cow?

Indeed it has. Milka is nicely put together with all the other cows.

Image analytics is such an exciting field in machine learning and now Orange is a part of it too! You need to install the Image Analytics add on and you are all set for your research!

k-Means & Silhouette Score

k-Means is one of the most popular unsupervised learning algorithms for finding interesting groups in our data. It can be useful in customer segmentation, finding gene families, determining document types, improving human resource management and so on.

But… have you ever wondered how k-means works? In the following three videos we explain how to construct a data analysis workflow using k-means, how k-means works, how to find a good k value and how silhouette score can help us find the inliers and the outliers.

 

#1 Constructing workflow with k-means

#2 How k-means works [interactive visualization]

#3 How silhouette score works and why it is useful

Why Orange?

Why is Orange so great? Because it helps people solve problems quickly and efficiently.

Sašo Jakljevič, a former student of the Faculty of Computer and Information Science at University of Ljubljana, created the following motivational videos for his graduation thesis. He used two belowed datasets, iris and zoo, to showcase how to tackle real-life problems with Orange.

Workshop on InfraOrange

Thanks to the collaboration with synchrotrons Elettra (Trieste) and Soleil (Paris), Orange is getting an add-on InfraOrange, with widgets for analysis of infrared spectra. Its primary users obviously come from these two institutions, hence we organized the first workshop for InfraOrange at one of them.

Some 20 participants spent the first day of the workshop in Trieste learning the basics of Orange and its use for data mining. With Janez at the helm and Marko assisting in the back, we traversed the standard list of visual and statistical techniques and a bit of unsupervised and supervised learning. The workshop was perhaps a bit unusual as most attendees were already quite familiar with these methods, but most haven’t yet used them in such an interactive fashion.

Marko explaining how to analyze spectral data with Orange.

 

On the second day Marko and Andrej took over and focused on the analysis of spectral data. We demonstrated the use of widgets specifically developed for infrared data and used them with data mining techniques we covered on the first day. After lunch the attendees tried to work on their own data sets, which was a real test for InfraOrange.

Orange for spectral data.

 

Group photo!

 

We now have a lot of realistic feedback on what to improve. There is a lot of work to do still, but a week after the workshop the most often occurring bugs have already been fixed.

The future of InfraOrange looks bright and…. khm… well, colorful! 🙂

Orange Workshops: Luxembourg, Pavia, Ljubljana

February was a month of Orange workshops.

Ljubljana: Biologists

We (Tomaž, Martin and I) have started in Ljubljana with a hands-on course for the COST Action FA1405 Systems Biology Training School. This was a four hour workshop with an introduction to classification and clustering, and then with application of machine learning to analysis of gene expression data on a plant called Arabidopsis. The organization of this course has even inspired us for a creation of a new widget GOMapMan Ontology that was added to Bioinformatics add-on. We have also experimented with workflows that combine gene expressions and images of mutant. The idea was to find genes with similar expression profile, and then show images of the plants for which these genes have stood out.

Luxembourg: Statisticians

This workshop took place at STATEC, Luxembourgh’s National Institute of Statistics and Economic Studies. We (Anže and I) got invited by Nico Weydert, STATEC’s deputy director, and gave a two day lecture on machine learning and data mining to a room full of experienced statisticians. While the purpose was to showcase Orange as a tool for machine learning, we have learned a lot from participants of the course: the focus of machine learning is still different from that of classical statistics.

Statisticians at STATEC, like all other statisticians, I guess, value, above all, understanding of the data, where accuracy of the models does not count if it cannot be explained. Machine learning often sacrifices understanding for accuracy. With focus on data and model visualization, Orange positions itself somewhere in between, but after our Luxembourg visit we are already planning on new widgets for explanation of predictions.

Pavia: Engineers

About fifty engineers of all kinds at University of Pavia. Few undergrads, then mostly graduate students, some postdocs and even quite a few of the faculty staff have joined this two day course. It was a bit lighter that the one in Luxembourg, but also covered essentials of machine learning: data management, visualization and classification with quite some emphasis on overfitting on the first day, and then clustering and data projection on the second day. We finished with a showcase on image embedding and analysis. I have in particular enjoyed this last part of the workshop, where attendees were asked to grab a set of images and use Orange to find if they can cluster or classify them correctly. They were all kinds of images that they have gathered, like flowers, racing cars, guitars, photos from nature, you name it, and it was great to find that deep learning networks can be such good embedders, as most students found that machine learning on their image sets works surprisingly well.

Related: BDTN 2016 Workshop on introduction to data science

Related: Data mining course at Baylor College of Medicine

We thank Riccardo Bellazzi, an organizer of Pavia course, for inviting us. Oh, yeah, the pizza at Rossopommodoro was great as always, though Michella’s pasta al pesto e piselli back at Riccardo’s home was even better.

My First Orange Widget

Recently, I took on a daunting task – programming my first widget. I’m not a programmer or a computer science grad, but I’ve been looking at Orange code for almost two years now and I thought I could handle it.

I set to create a simple Concordance widget that displays word contexts in a corpus (the widget will be available in the future release). The widget turned out to be a little more complicated than I originally anticipated, but it was a great exercise in programming.

Today, I’ll explain how I got started with my widget development. We will create a very basic Word Finder widget, that just goes through the corpus (data) and tells you whether a word occurs in a corpus or not. This particular widget is meant to be a part of the Orange3-Text add-on (so you need the add-on installed to try it), but the basic structure is the same for all widgets.

 

First, I have to set the basic widget class.

class OWWordFinder(OWWidget):
    name = "Word Finder"
    description = "Display whether a word is in a text or not."
    icon = "icons/WordFinder.svg"
    priority = 1

    inputs = [('Corpus', Table, 'set_data')]
    # This widget will have no output, but in case you want one, you define it as below.
    # outputs = [('Output Name', output_type, 'output_method')]

    want_control_area = False

 

This sets up the description of the widget, icon, inputs and so on. want_control_area is where we say we only want the main window. Both are on by default in Orange and this simply hides the empty control area on the widget’s left side. If your widget has any parameters and controls, leave the control area on and place the buttons there.

 

In __init__ we define widget properties (such as data and queried word) and set the view. I decided to go with a very simple design – I just put everything in the mainArea. For such a basic widget this might be ok, but otherwise you might want to dig deeper into models and use QTableView, QGraphicsScene or something similar. Here we will build just the bare bones of a functioning widget.

def __init__(self):
        super().__init__()

        self.corpus = None    # input data
        self.word = ""        # queried word

        # setting the gui
        gui.widgetBox(self.mainArea, orientation="vertical")
        self.input = gui.lineEdit(self.mainArea, self, '',
                                  orientation="horizontal",
                                  label='Query:')
        self.input.setFocus()
        # run method self.search on every text change
        self.input.textChanged.connect(self.search)
        
        # place a text label in the mainArea
        self.view = QLabel()
        self.mainArea.layout().addWidget(self.view)

Ok, this now sets the __init__: what the widget remembers and how it looks like. With our buttons in place, the widget needs some methods, too.

 

The first method will update the self.corpus attribute, when the widget receives an input.

def set_data(self, data=None):
        if data is not None and not isinstance(data, Corpus):
            self.corpus = Corpus.from_table(data.domain, data)
        self.corpus = data
        self.search()

At the end we called self.search() method, which we already met in __init__ above. This method is key to our widget, as it will run the search every time the word changes. Moreover, it will run the method on the same query word when the widget is provided with a new data set, which is why we set it also in set_data().

 

Ok, let’s finally write this method.

def search(self):
        self.word = self.input.text()
        # self.corpus.tokens will run a default tokenizer, if no tokens are provided on the input
        result = any(self.word in doc for doc in self.corpus.tokens)
        self.view.setText(str(result))

 

This is it. This is our widget. Good job. Creating a new widget can indeed be lot of fun. You can go from a quite basic widget to very intricate, depending on your sense of adventure.

Finally, you can get the entire widget code in gist.

Happy programming, everyone! 🙂