k-Means & Silhouette Score

k-Means is one of the most popular unsupervised learning algorithms for finding interesting groups in our data. It can be useful in customer segmentation, finding gene families, determining document types, improving human resource management and so on.

But… have you ever wondered how k-means works? In the following three videos we explain how to construct a data analysis workflow using k-means, how k-means works, how to find a good k value and how silhouette score can help us find the inliers and the outliers.

 

#1 Constructing workflow with k-means

#2 How k-means works [interactive visualization]

#3 How silhouette score works and why it is useful

Why Orange?

Why is Orange so great? Because it helps people solve problems quickly and efficiently.

Sašo Jakljevič, a former student of the Faculty of Computer and Information Science at University of Ljubljana, created the following motivational videos for his graduation thesis. He used two belowed datasets, iris and zoo, to showcase how to tackle real-life problems with Orange.

Workshop on InfraOrange

Thanks to the collaboration with synchrotrons Elettra (Trieste) and Soleil (Paris), Orange is getting an add-on InfraOrange, with widgets for analysis of infrared spectra. Its primary users obviously come from these two institutions, hence we organized the first workshop for InfraOrange at one of them.

Some 20 participants spent the first day of the workshop in Trieste learning the basics of Orange and its use for data mining. With Janez at the helm and Marko assisting in the back, we traversed the standard list of visual and statistical techniques and a bit of unsupervised and supervised learning. The workshop was perhaps a bit unusual as most attendees were already quite familiar with these methods, but most haven’t yet used them in such an interactive fashion.

Marko explaining how to analyze spectral data with Orange.

 

On the second day Marko and Andrej took over and focused on the analysis of spectral data. We demonstrated the use of widgets specifically developed for infrared data and used them with data mining techniques we covered on the first day. After lunch the attendees tried to work on their own data sets, which was a real test for InfraOrange.

Orange for spectral data.

 

Group photo!

 

We now have a lot of realistic feedback on what to improve. There is a lot of work to do still, but a week after the workshop the most often occurring bugs have already been fixed.

The future of InfraOrange looks bright and…. khm… well, colorful! 🙂

Orange Workshops: Luxembourg, Pavia, Ljubljana

February was a month of Orange workshops.

Ljubljana: Biologists

We (Tomaž, Martin and I) have started in Ljubljana with a hands-on course for the COST Action FA1405 Systems Biology Training School. This was a four hour workshop with an introduction to classification and clustering, and then with application of machine learning to analysis of gene expression data on a plant called Arabidopsis. The organization of this course has even inspired us for a creation of a new widget GOMapMan Ontology that was added to Bioinformatics add-on. We have also experimented with workflows that combine gene expressions and images of mutant. The idea was to find genes with similar expression profile, and then show images of the plants for which these genes have stood out.

Luxembourg: Statisticians

This workshop took place at STATEC, Luxembourgh’s National Institute of Statistics and Economic Studies. We (Anže and I) got invited by Nico Weydert, STATEC’s deputy director, and gave a two day lecture on machine learning and data mining to a room full of experienced statisticians. While the purpose was to showcase Orange as a tool for machine learning, we have learned a lot from participants of the course: the focus of machine learning is still different from that of classical statistics.

Statisticians at STATEC, like all other statisticians, I guess, value, above all, understanding of the data, where accuracy of the models does not count if it cannot be explained. Machine learning often sacrifices understanding for accuracy. With focus on data and model visualization, Orange positions itself somewhere in between, but after our Luxembourg visit we are already planning on new widgets for explanation of predictions.

Pavia: Engineers

About fifty engineers of all kinds at University of Pavia. Few undergrads, then mostly graduate students, some postdocs and even quite a few of the faculty staff have joined this two day course. It was a bit lighter that the one in Luxembourg, but also covered essentials of machine learning: data management, visualization and classification with quite some emphasis on overfitting on the first day, and then clustering and data projection on the second day. We finished with a showcase on image embedding and analysis. I have in particular enjoyed this last part of the workshop, where attendees were asked to grab a set of images and use Orange to find if they can cluster or classify them correctly. They were all kinds of images that they have gathered, like flowers, racing cars, guitars, photos from nature, you name it, and it was great to find that deep learning networks can be such good embedders, as most students found that machine learning on their image sets works surprisingly well.

We thank Riccardo Bellazzi, an organizer of Pavia course, for inviting us. Oh, yeah, the pizza at Rossopommodoro was great as always, though Michella’s pasta al pesto e piselli back at Riccardo’s home was even better.

My First Orange Widget

Recently, I took on a daunting task – programming my first widget. I’m not a programmer or a computer science grad, but I’ve been looking at Orange code for almost two years now and I thought I could handle it.

I set to create a simple Concordance widget that displays word contexts in a corpus (the widget will be available in the future release). The widget turned out to be a little more complicated than I originally anticipated, but it was a great exercise in programming.

Today, I’ll explain how I got started with my widget development. We will create a very basic Word Finder widget, that just goes through the corpus (data) and tells you whether a word occurs in a corpus or not. This particular widget is meant to be a part of the Orange3-Text add-on (so you need the add-on installed to try it), but the basic structure is the same for all widgets.

 

First, I have to set the basic widget class.

class OWWordFinder(OWWidget):
    name = "Word Finder"
    description = "Display whether a word is in a text or not."
    icon = "icons/WordFinder.svg"
    priority = 1

    inputs = [('Corpus', Table, 'set_data')]
    # This widget will have no output, but in case you want one, you define it as below.
    # outputs = [('Output Name', output_type, 'output_method')]

    want_control_area = False

 

This sets up the description of the widget, icon, inputs and so on. want_control_area is where we say we only want the main window. Both are on by default in Orange and this simply hides the empty control area on the widget’s left side. If your widget has any parameters and controls, leave the control area on and place the buttons there.

 

In __init__ we define widget properties (such as data and queried word) and set the view. I decided to go with a very simple design – I just put everything in the mainArea. For such a basic widget this might be ok, but otherwise you might want to dig deeper into models and use QTableView, QGraphicsScene or something similar. Here we will build just the bare bones of a functioning widget.

def __init__(self):
        super().__init__()

        self.corpus = None    # input data
        self.word = ""        # queried word

        # setting the gui
        gui.widgetBox(self.mainArea, orientation="vertical")
        self.input = gui.lineEdit(self.mainArea, self, '',
                                  orientation="horizontal",
                                  label='Query:')
        self.input.setFocus()
        # run method self.search on every text change
        self.input.textChanged.connect(self.search)
        
        # place a text label in the mainArea
        self.view = QLabel()
        self.mainArea.layout().addWidget(self.view)

Ok, this now sets the __init__: what the widget remembers and how it looks like. With our buttons in place, the widget needs some methods, too.

 

The first method will update the self.corpus attribute, when the widget receives an input.

def set_data(self, data=None):
        if data is not None and not isinstance(data, Corpus):
            self.corpus = Corpus.from_table(data.domain, data)
        self.corpus = data
        self.search()

At the end we called self.search() method, which we already met in __init__ above. This method is key to our widget, as it will run the search every time the word changes. Moreover, it will run the method on the same query word when the widget is provided with a new data set, which is why we set it also in set_data().

 

Ok, let’s finally write this method.

def search(self):
        self.word = self.input.text()
        # self.corpus.tokens will run a default tokenizer, if no tokens are provided on the input
        result = any(self.word in doc for doc in self.corpus.tokens)
        self.view.setText(str(result))

 

This is it. This is our widget. Good job. Creating a new widget can indeed be lot of fun. You can go from a quite basic widget to very intricate, depending on your sense of adventure.

Finally, you can get the entire widget code in gist.

Happy programming, everyone! 🙂

 

For When You Want to Transpose a Data Table…

Sometimes, you need something more. Something different. Something, that helps you look at the world from a different perspective. Sometimes, you simply need to transpose your data.

Since version 3.3.9, Orange has a Transpose widget that flips your data table around. Columns become rows and rows become columns. This is often useful, if you have, say, biological data.

Related: Datasets in Orange Bioinformatics

Today we will play around with brown-selected.tab, a data set on gene expression levels for 79 experiments. Our data table has genes in rows and experiments in columns, with gene expression levels recorded as values.

This representation focuses on exploring genes and allows them to be plotted or to construct a model to predict their functions. But what if I want to explore the experimental conditions and see how different external conditions influence the yeast cells? For this, it would be better to have experiments in rows and genes in columns. We can do this with Transpose.

Transpose widget took our gene meta attribute and used it for the new column names (YGR270W, YIL075C, etc.). It also appended class values to columns (Proteas). Former columns names (alpha 0, alpha 7, etc.) became our new meta attribute called Feature name.

Ok, we have a transposed data table. Now we ask ourselves: “Do similar experiment types (e.g. heat, cold, alpha, …) behave similarly?”

Let’s use PCA to transform these many-dimensional experiment vectors into a 2-D representation. We are going to use Scatter Plot to observe experiments (not genes) in a 2-D space. We expect the same experiment types to lie closer together than other experiments. A scatter plot after a 2-D transformation looks like this:

Spo5 11 lies quite far from the rest, so it could be an experiment to look out for. If we zoom in on the big cluster, we see that similar experiments indeed lie closer together.

Now, if you are reproducing the result, you probably won’t see these nice colors for class.

This is because we used the Create Class widget to help us create new class values. Create Class already available in Orange3-Prototypes add-on and will be included in a future Orange release. You can learn more about it soon… 🙂

 

Preparing Scraped Data

One of the key questions of every data analysis is how to get the data and put it in the right form(at). In this post I’ll show you how to easily get the data from the web and transfer it to a file Orange can read.

Related: Creating a new data table in Orange through Python

 

First, we’ll have to do some scripting. We’ll use a couple of Python libraries – urllib.requests fetching the data, BeautifulSoup for reading it, csv for writing it and regular expressions for extracting the right data.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import re

Ok, we’ve imported the all the libraries we’ll need. Now we will scrape the data from our own blog to see how many posts we’ve written throughout the years.

html = urlopen('http://blog.biolab.si')
soup = BeautifulSoup(html.read(), 'lxml')

The first line opens the address of the site we want to scrape. In our case this is our blog. The second line retrieves a html response from the site, which is our raw text. It looks like this:

<aside id="archives-2" class="widget widget_archive"><h3 class="widget-title">Archives</h3>
<ul>
   <li><a href='http://blog.biolab.si/2017/01/'>January 2017</a>&nbsp;(1)</li>
   <li><a href='http://blog.biolab.si/2016/12/'>December 2016</a>&nbsp;(3)</li>
   <li><a href='http://blog.biolab.si/2016/11/'>November 2016</a>&nbsp;(4)</li>
   <li><a href='http://blog.biolab.si/2016/10/'>October 2016</a>&nbsp;(3)</li>
   <li><a href='http://blog.biolab.si/2016/09/'>September 2016</a>&nbsp;(2)</li>
   <li><a href='http://blog.biolab.si/2016/08/'>August 2016</a>&nbsp;(5)</li>
   <li><a href='http://blog.biolab.si/2016/07/'>July 2016</a>&nbsp;(3)</li>....

Ok, html is nice, but we can’t really do data analysis with this. We will have to transform this output into something sensible. How about .csv, a simple comma-demilited format Orange can recognize?

with open('scraped.csv', 'w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',')

We created a new file called ‘scraped.csv‘ to which we will write our content (‘w’ parameter means write). Then we defined the writer and set the delimiter to comma.

Now we need to add the header row, so Orange will know what are the column names. We add this just after csvwriter variable.

csvwriter.writerow(["Date", "No_of_Blogs"])

Now we have two columns, one named ‘Date’ and the other ‘No_of_Blogs’. The final step is to extract the data. We have a bunch of lines in html, but the one we’re interested in is in a section ‘aside’ and has an id ‘archives-2‘. We will first extract only this section (.find(id=’archives-2′) and get all the lines of the archive with the tag ‘li’ (.find_all(‘li’)):

for item in soup.find(id="archives-2").find_all('li'):

This is the result of print(item).

<li><a href="http://blog.biolab.si/2017/01/">January 2017</a> (1)</li>

Now we need to get the actual content from each line. The first part we need is the date of the archived content. Orange can read dates, but they need to come in the right format. We will extract the date from href part with item.a.get(‘href’). Then we need to extract only digits from it as we’re not interested in the rest of the link. We do this with Regex for finding digits:

date = re.findall(r'\d+', item.a.get('href'))

Regex’s findall function returns a list, in our case containing two items – the year and month of archived content. The second part of our data is the number of blogs archived in a particular month. We will again extract this with a Regex digit search, but this time we will be extracting data from the actual content – ‘item.contents[1]‘.

digits = re.findall(r'\d+', item.contents[1])

Finally, we need to write each line to a .csv file we created above.

csvwriter.writerow(["%s-%s-01" % (date[0], date[1]), digits[0]])

Here, we formatted the date into an ISO-standard format Orange recognizes as time variable (“%s-%s-01” % (date[0], date[1])), while the second part is simply a count of our blog posts.

This is the entire code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import re

html = urlopen('http://blog.biolab.si')
soup = BeautifulSoup(html.read(), 'lxml')

with open('scraped.csv', 'w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',')
    csvwriter.writerow(["Date", "No_of_Blogs"])
    for item in soup.find(id="archives-2").find_all('li'):
        date = re.findall(r'\d+', item.a.get('href'))
        digits = re.findall(r'\d+', item.contents[1])
        csvwriter.writerow(["%s-%s-01" % (date[0], date[1]), digits[0]])

Related: Scripting with Time Variable

 

Now let’s load this in Orange. File widget can easily read .csv formats and also correctly identifies the two column types, datetime and numeric. A quick glance into the Data Table…

Everything looks ok. We can use Timeseries add-on to inspect how many blogs we’ve written each month since 2010. Connect As Timeseries widget to File. Orange will automatically suggest to use Date as our time variable. Finally, we’ll plot the data with Line Chart. This is the curve of our blog activity.

The example is extremely simple. A somewhat proficient user can extract much more interesting data than a simple blog count, but one always needs to keep in mind the legal aspects of web scraping. Nevertheless, this is a popular and fruitful way to extract and explore the data!

Data Preparation for Machine Learning

We’ve said it numerous times and we’re going to say it again. Data preparation is crucial for any data analysis. If your data is messy, there’s no way you can make sense of it, let alone a computer. Computers are great at handling large, even enormous data sets, speedy computing and recognizing patterns. But they fail miserably if you give them the wrong input. Also some classification methods work better with binary values, other with continuous, so it is important to know how to treat your data properly.

Orange is well equipped for such tasks.

 

Widget no. 1: Preprocess

Preprocess is there to handle a big share of your preprocessing tasks.

 

Original data.

 

  • It can normalize numerical data variables. Say we have a fictional data set of people employed in your company. We want to know which employees are more likely to go on holiday, based on the yearly income, years employed in your company and total years of experience in the industry. If you plot this in heat map, you would see a bold yellow line at ‘yearly income’. This obviously happens because yearly income has much higher values than years of experience or years employed by your company. You would naturally like the wage not to overweight the rest of the feature set, so normalization is the way to go. Normalization will transform your values to relative terms, that is, say (depending on the type of normalization) on a scale from 0 to 1. Now Heat Map neatly shows that people who’ve been employed longer and have a higher wage more often go on holidays. (Yes, this is a totally fictional data set, but you see the point.)

heatmap1
  no normalization

heatmap2
   normalized data

 

  • It can impute missing values. Average or most frequent missing value imputation might seem as overly simple, but it actually works most of the time. Also, all the learners that require imputation do it implicitly, so the user doesn’t have to configure yet another widget for that.
  • If you want to compare your results against a randomly mixed data set, select ‘Randomize’ or if you want to select relevant features, this is the widget for it.

Preprocessing needs to be used with caution and understanding of your data to avoid losing important information or, worse, overfitting the model. A good example is a case of paramedics, who usually don’t record pulse if it is normal. Missing values here thus cannot be imputed by an average value or random number, but as a distinct value (normal pulse). Domain knowledge is always crucial for data preparation.

 

Widget no. 2: Discretize

For certain tasks you might want to resort to binning, which is what Discretize does. It effectively distributes your continuous values into a selected number of bins, thus making the variable discrete-like. You can either discretize all your data variables at once, using selected discretization type, or select a particular discretization method for each attribute. The cool thing is the transformation is already displayed in the widget, so you instantly know what you’re getting in the end. A good example of discretization would be having a data set of your customers with their age recorded. It would make little sense to segment customers by each particular age, so binning them into 4 age groups (young, young-adult, middle-aged, senior) would be a great solution. Also some visualizations require feature transformation – Sieve Diagram is currently one such widget. Mosaic Display, however, has the transformation already implemented internally.

 

discretize1
original data

 

discretize2
Discretized data with ‘years employed’ lower or higher then/equal to 8 (same for ‘yearly income’ and ‘experience in the industry’.

 

Widget no. 3: Continuize

This widget essentially creates new attributes out of your discrete ones. If you have, for example, an attribute with people’s eye color, where values can be either blue, brown or green, you would probably want to have three separate attributes ‘blue’, ‘green’ and ‘brown’ with 0 or 1 if a person has that eye color. Some learners perform much better if data is transformed in such a way. You can also only have attributes where you would presume 0 is a normal condition and would only like to have deviations from the normal state recorded (‘target or first value as base’) or the normal state would be the most common value (‘most frequent value as base’). Continuize widget offers you a lot of room to play. Best thing is to select a small data set with discrete values, connect it to Continuize and then further to Data Table and change the parameters. This is how you can observe the transformations in real time. It is useful for projecting discrete data points in Linear Projection.

 

continuize1
Original data.

 

continuize2
Continuized data with two new columns – attribute ‘position’ was replaced by attributes ‘position=office worker’ and ‘position=technical staff’ (same for ‘gender’).

 

Widget no. 4: Purge Domain

Get a broom and sort your data! That’s what Purge Domain does. If all of the values of some attributes are constant, it will remove these attributes. If you have unused (empty) attributes in your data, it will remove them. Effectively, you will get a nice and comprehensive data set in the end.

purge1
Original data.

 

purge2
Empty columns and columns with the same (constant) value were removed.

 

Of course, don’t forget to include all these procedures into your report with the ‘Report’ button! 🙂

The Beauty of Random Forest

It is the time of the year when we adore Christmas trees. But these are not the only trees we, at Orange team, think about. In fact, through almost life-long professional deformation of being a data scientist, when I think about trees I would often think about classification and regression trees. And they can be beautiful as well. Not only for their elegance in explaining the hidden patterns, but aesthetically, when rendered in Orange. And even more beautiful then a single tree is Orange’s rendering of a forest, that is, a random forest.

Related: Pythagorean Trees and Forests

Here are six trees in the random forest constructed on the housing data set:

The random forest for annealing data set includes a set of smaller-sized trees:

A Christmas-lit random forest inferred from pen digits data set looks somehow messy in trying to categorize to ten different classes:

The power of beauty! No wonder random forests are one of the best machine learning tools. Orange renders them according to the idea of Fabian Beck and colleagues who proposed Pythagoras trees for visualizations of hierarchies. The actual implementation for classification and regression trees for Orange was created by Pavlin Policar.

BDTN 2016 Workshop: Introduction to Data Science

Every year BEST Ljubljana organizes BEST Days of Technology and Sciences, an event hosting a broad variety of workshops, hackathons and lectures for the students of natural sciences and technology. Introduction to Data Science, organized by our own Laboratory for Bioinformatics, was this year one of them.

Related: Intro to Data Mining for Life Scientists

The task was to teach and explain basic data mining concepts and techniques in four hours. To complete beginners. Not daunting at all…

Luckily, we had Orange at hand. First, we showed how the program works and how to easily import data into the software. We created a poll using Google Forms on the fly and imported the results from Google Sheets into Orange.

To get the first impression of our data, we used Distributions and Scatter Plot. This was just to show how to approach the construction and simple visual exploration on any new data set. Then we delved deep into the workings of classification with Classification Tree and Tree Viewer and showed how easy it is to fall into the trap of overfitting (and how to avoid it). Another topic was clustering and how to relate similar data instances to one another. Finally, we had some fun with ImageAnalytics add-on and observed whether we can detect wrongly labelled microscopy images with machine learning.

Related: Data Mining Course in Houston #2

These workshops are not only fun, but an amazing learning opportunity for us as well, as they show how our users think and how to even further improve Orange.