Workshop: Text Analysis for Social Scientists

Yesterday was no ordinary day at the Faculty of Computer and Information Science, University of Ljubljana – there was an unusually high proportion of Social Sciences students, researchers and other professionals in our classrooms. It was all because of a Text Analysis for Social Scientists workshop.

Related: Data Mining for Political Scientists

Text mining is becoming a popular method across sciences and it was time to showcase what it (and Orange) can do. In this 5-hour hands-on workshop we explained text preprocessing, clustering, and predictive models, and applied them in the analysis of selected Grimm’s Tales. We discovered that predictive models can nicely distinguish between animal tales and tales of magic and that foxes and kings play a particularly important role in separating between the two types.

Nomogram displays 6 most important words (attributes) as defined by Logistic Regression. Seems like the occurrence of the word ‘fox’ can tell us a lot about whether the text is an animal tale or a tale of magic.

Related: Nomogram

The second part of the workshop was dedicated to the analysis of tweets – we learned how to work with thousands of tweets on a personal computer, we plotted them on a map by geolocation, and used Instagram images for image clustering.

Related: Image Analytics: Clustering

Five hours was very little time to cover all the interesting topics in text analytics. But Orange came to the rescue once again. Interactive visualization and the possibility of close reading in Corpus Viewer were such a great help! Instead of reading 6400 tweets ‘by hand’, now the workshop participants can cluster them in interesting groups, find important words in each cluster and plot them in a 2D visualization.

Participants at work.

Here, we’d like to thank NumFocus for providing financial support for the course. This enabled us to bring in students from a wide variety of fields (linguists, geographers, marketers) and prove (once again) that you don’t have to be a computer scientists to do machine learning!


Orange Workshops: Luxembourg, Pavia, Ljubljana

February was a month of Orange workshops.

Ljubljana: Biologists

We (Tomaž, Martin and I) have started in Ljubljana with a hands-on course for the COST Action FA1405 Systems Biology Training School. This was a four hour workshop with an introduction to classification and clustering, and then with application of machine learning to analysis of gene expression data on a plant called Arabidopsis. The organization of this course has even inspired us for a creation of a new widget GOMapMan Ontology that was added to Bioinformatics add-on. We have also experimented with workflows that combine gene expressions and images of mutant. The idea was to find genes with similar expression profile, and then show images of the plants for which these genes have stood out.

Luxembourg: Statisticians

This workshop took place at STATEC, Luxembourgh’s National Institute of Statistics and Economic Studies. We (Anže and I) got invited by Nico Weydert, STATEC’s deputy director, and gave a two day lecture on machine learning and data mining to a room full of experienced statisticians. While the purpose was to showcase Orange as a tool for machine learning, we have learned a lot from participants of the course: the focus of machine learning is still different from that of classical statistics.

Statisticians at STATEC, like all other statisticians, I guess, value, above all, understanding of the data, where accuracy of the models does not count if it cannot be explained. Machine learning often sacrifices understanding for accuracy. With focus on data and model visualization, Orange positions itself somewhere in between, but after our Luxembourg visit we are already planning on new widgets for explanation of predictions.

Pavia: Engineers

About fifty engineers of all kinds at University of Pavia. Few undergrads, then mostly graduate students, some postdocs and even quite a few of the faculty staff have joined this two day course. It was a bit lighter that the one in Luxembourg, but also covered essentials of machine learning: data management, visualization and classification with quite some emphasis on overfitting on the first day, and then clustering and data projection on the second day. We finished with a showcase on image embedding and analysis. I have in particular enjoyed this last part of the workshop, where attendees were asked to grab a set of images and use Orange to find if they can cluster or classify them correctly. They were all kinds of images that they have gathered, like flowers, racing cars, guitars, photos from nature, you name it, and it was great to find that deep learning networks can be such good embedders, as most students found that machine learning on their image sets works surprisingly well.

Related: BDTN 2016 Workshop on introduction to data science

Related: Data mining course at Baylor College of Medicine

We thank Riccardo Bellazzi, an organizer of Pavia course, for inviting us. Oh, yeah, the pizza at Rossopommodoro was great as always, though Michella’s pasta al pesto e piselli back at Riccardo’s home was even better.

BDTN 2016 Workshop: Introduction to Data Science

Every year BEST Ljubljana organizes BEST Days of Technology and Sciences, an event hosting a broad variety of workshops, hackathons and lectures for the students of natural sciences and technology. Introduction to Data Science, organized by our own Laboratory for Bioinformatics, was this year one of them.

Related: Intro to Data Mining for Life Scientists

The task was to teach and explain basic data mining concepts and techniques in four hours. To complete beginners. Not daunting at all…

Luckily, we had Orange at hand. First, we showed how the program works and how to easily import data into the software. We created a poll using Google Forms on the fly and imported the results from Google Sheets into Orange.

To get the first impression of our data, we used Distributions and Scatter Plot. This was just to show how to approach the construction and simple visual exploration on any new data set. Then we delved deep into the workings of classification with Classification Tree and Tree Viewer and showed how easy it is to fall into the trap of overfitting (and how to avoid it). Another topic was clustering and how to relate similar data instances to one another. Finally, we had some fun with ImageAnalytics add-on and observed whether we can detect wrongly labelled microscopy images with machine learning.

Related: Data Mining Course in Houston #2

These workshops are not only fun, but an amazing learning opportunity for us as well, as they show how our users think and how to even further improve Orange.

Intro to Data Mining for Life Scientists

RNA Club Munich has organized Molecular Life of Stem Cells Conference in Ljubljana this past Thursday, Friday and Saturday. They asked us to organize a four-hour workshop on data mining. And here we were: four of us, Ajda, Anze, Marko and myself (Blaz) run a workshop for 25 students with molecular biology and biochemistry background.


We have covered some basic data visualization, modeling (classification) and model scoring, hierarchical clustering and data projection, and finished with a touch of deep-learning by diving into image analysis by deep learning-based embedding.

Related: Data Mining Course at Baylor College of Medicine in Houston

It’s not easy to pack so many new things on data analytics within four hours, but working with Orange helps. This was a hands-on workshop. Students brought their own laptops with Orange and several of its add-ons for bioinformatics and image analytics. We also showed how to prepare one’s own data using Google Forms and designed a questionary, augment it in a class, run it with students and then analyze the questionary with Orange.




The hard part of any short course that includes machine learning is how to explain overfitting. The concept is not trivial for data science newcomers, but it is so important it simply cannot be left out. Luckily, Orange has some cool widgets to help us understanding the overfitting. Below is a workflow we have used. We read some data (this time it was a yeast gene expression data set called brown-selected that comes with Orange), “destroyed the data” by randomly permuting the column with class values, trained a classification tree, and observed near perfect results when the model was checked on the training data.


Sure this works, you are probably saying. The models should have been scored on a separate test set! Exactly, and this is what we have done next with Data Sampler, which lead us to cross-validation and Test & Score widget.

This was a great and interesting short course and we were happy to contribute to the success of the student-run MLSC-2016 conference.

Orange workshops around the world

Even though the summer is nigh, we are hardly going to catch a summer break this year. Orange team is busy holding workshops around the world to present the latest widgets and data mining tools to the public. Last week we had a very successful tutorial at [BC]2 in Basel, Switzerland, where Marinka and Blaž presented data fusion. A part of the tutorial was a hands-on workshop with Orange’s new add-on for data fusion. Marinka also got an award for the poster, where data fusion was used to hunt for Dictyostelium bacterial-response genes. This week, we are in Pavia, Italy, also for Matrix Computations in Biomedical Informatics Workshop at AIME 2015, a Conference on Artificial Intelligence in Medicine. During the workshop, we are giving an invited talk on learning latent factor models by data fusion and we’ll also show Orange’s data fusion add-on. Thanks to the workshop organizers, Riccardo Bellazzi, Jimeng Sun and Ping Zhang, the workshop program looks great.


Blaž with Riccardo and John in Pavia, Italy

Workshops at Baylor College of Medicine

On May 22nd and May 23rd, we (Blaz Zupan and Janez Demsar, assisted by Marinka Zitnik and Balaji Santhanam) have given two hands-on workshops called Data Mining without Programming at Baylor College of Medicine in Houston, Texas.

Actually, there was a lot of programming, but no Python or alike. The workshop was designed for biomedical students and Baylor’s faculty members. We have presented a visual programming approach for development of data mining workflows for interactive data exploration. A three-hour workshop consisted of 15 data mining lessons on visual data exploration, classification, clustering, network analysis, and gene expression analytics. Each lesson focused on a particular data analysis task that the attendees solved with Orange.

The two workshops were organized by Baylor’s Computational and Integrative Biomedical Research Center. Over two days, the event was attended by a large audience of 120 attendees.