Orange at GIS Ostrava

Ostrava is a city in the north-east of the Czech Republic and is the capital of the Moravian-Silesian Region. GIS Ostrava is a yearly conference organized by Jiří Horák and his team at the Technical University of Ostrava. University has a nice campus with a number of new developments. I have learned that this is the largest university campus in central and eastern Europe, as most of the universities, like mine, are city universities with buildings dispersed around the city.

During the conference, I gave an invited talk on “Data Science for Everyone” and showed how Orange can be used to teach basic data science concepts in a few hours so that the trainee can gain some intuition about what data science is and then, preferably, use the software on their own data. To prove this concept, I gave an example workshop during the next day of the conference. The workshop was also attended by several teachers that are thinking of incorporating Orange within their data science curricula.

Admittedly, there was not much GIS in my presentations, as I – as planned – focused more on data science. But I did include an example of how to project the data in Orange to geographical maps. The example involved the analysis of Human Development Index data and clustering. When projected to the map, the results of clustering could be unexpected if we select only the features that address quality of life: check out the map below and try to figure out what is wrong.

Here, I would like to thank Igor Ivan and Jiri Horak for the invitation, and their group and specifically Michal Kacmarik for the hospitality.

The Changing Status Bar

Every week on Friday, when the core team of Orange developers meets, we are designing new improvements of Orange’s graphical interface. This time, it was the status bar. Well, actually, it was the status bar quite a while ago and required the change of the core widget library, but it is materializing these days and you will see the changes in the next release.

Consider the Neighbors widget. The widget considers the input data and reference data items, and outputs instance form input data that are most similar to the references. Like, if the dolphin is a reference, we would like to know which are the three most similar animals. But this is not what want I wanted to write about. I would only like to say that we are making a slight change in the user interface. Below is the Neighbors widget in the current release of Orange, and the upcoming one.

See the difference? We are getting rid of the infobox on the top of the control tab, and moving it to the status bar. In the infobox widgets typically display what is in their input and what is on the output after the data has been processed. Moving this information to the status bar will make widgets more compact and less cluttered. We will similarly change the infoboxes in this way in all of the widgets.

Single-Cell Data Science for Everyone

Molecular biologists have in the past twenty years invented technologies that can collect abundant experimental data. One such technique is single-cell RNA-seq, which, very simplified, can measure the activity of genes in possibly large collections of cells. The interpretation of such data can tell us about the heterogeneity of cells, cell types, or provide information on their development.

Typical analysis toolboxes for single-cell data are available in R and Python and, most notably, include Seurat and scanpy, but they lack interactive visualizations and simplicity of Orange. Since the fall of 2017, we have been developing an extension of Orange, which is now (almost) ready. It has even been packed into its own installer. The first real test of the software was in early 2018 through a one day workshop at Janelia Research Campus. On March 6, and with a much more refined version of the software, we have now repeated the hands-on workshop at the University of Pavia.

The five-hour workshop covered both the essentials of data analysis and single cell analytics. The topics included data preprocessing, clustering, and two-dimensional embedding, as well as working with marker genes, differential expression analysis, and interpretation of clusters through gene ontology analysis.

I want to thank Prof. Dr. Riccardo Bellazzi and his team for the organization, and Erasmus program for financial support. I have been a frequent guest to Pavia, and learn something new every time I am there. Besides meeting new students and colleagues that attended the workshop and hearing about their data analysis challenges, this time I have also learned about a dish I had never had before in all my Italian travels. For one of the dinners (thank you, Michela) we had Pizzoccheri. Simply great!

Single cell analytics workshop at HHMI | Janelia

HHMI | Janelia is one of the prettiest researcher campuses I have ever visited. Located in Ashburn, VA, about 20 minutes from Washington Dulles airport, it is conveniently located yet, in a way, secluded from the buzz of the capital. We adored the guest house with a view of the lake, tasty Janelia-style breakfast (hash-browns with two eggs and sausage, plus a bagel with cream cheese) in the on-campus pub, beautifully-designed interiors to foster collaborations and interactions, and late-evening discussions in the in-house pub.

All these thanks to the invitation of Andrew Lemire, a manager of a shared high-throughput genomics resource, and Dr. Vilas Menon, a mathematician specializing in quantitative genomics. With Andy and Vilas, we have been collaborating in the past few months on trying to devise a simple and intuitive tool for analysis of single-cell gene expression data. Single cell high-throughput technology is one of the latest approaches that allow us to see what is happening within a single cell, and it does that by simultaneously scanning through potentially thousands of cells. That generates loads of data, and apparently, we have been trying to fit Orange for single-cell data analysis task.

Namely, in the past half a year, we have been perfecting an add-on for Orange with components for single-cell analysis. This endeavor became so vital that we have even designed a new installation of Orange, called scOrange. With everything still in prototype stage, we had enough courage to present the tool at Janelia, first through a seminar, and the next day within a five-hour lecture that I gave together with Martin Strazar, a PhD student and bioinformatics expert from my lab. Many labs are embarking on single cell technology at Janelia, and by the crowd that gathered at both events, it looks like that everyone was there.

Orange, or rather, scOrange, worked as expected, and hands-on workshop was smooth, despite testing the software on some rather large data sets. Our Orange add-on for single-cell analytics is still in early stage of development, but already has some advanced features like biomarker discovery and tools for characterization of cell clusters that may help in revealing hidden relations between genes and phenotypes. Thanks to Andy and Vilas, Janelia proved an excellent proving ground for scOrange, and we are looking forward to our next hands-on single-cell analytics workshop in Houston.

Related: Hands-On Data Mining Course in Houston

Orange in Kolkata, India

We have just completed the hands-on course on data science at one the most famous Indian educational institutions, Indian Statistical Institute. A one week course was invited by Institute’s director Prof. Dr. Sanghamitra Bandyopadhyay, and financially supported by the founding of India’s Global Initiative of Academic Networks.

Indian Statistical Institute lies in the hearth of old Kolkata. A peaceful oasis of picturesque campus with mango orchards and waterlily lakes was founded by Prof. Prasanta Chandra Mahalanobis, one of the giants of statistics. Today, the Institute researches statistics and computational approaches to data analysis and runs a grad school, where a rather small number of students are hand-picked from tens of thousands of applicants.

The course was hands-on. The number of participants was limited to forty, the limitation posed by the number of the computers in Institute’s largest computer lab. Half of the students came from Institute’s grad school, and another half from other universities around Kolkata or even other schools around India, including a few participants from another famous institution, India Institutes of Technology. While the lecture included some writing on the white-board to explain machine learning, the majority of the course was about exploring example data sets, building workflows for data analysis, and using Orange on practical cases.

The course was not one of the lightest for the lecturer (Blaž Zupan). About five full hours each day for five days in a row, extremely motivated students with questions filling all of the coffee breaks, the need for deeper dive into some of the methods after questions in the classroom, and much need for improvisation to adapt our standard data science course to possibly the brightest pack of data science students we have seen so far. We have covered almost a full spectrum of data science topics: from data visualization to supervised learning (classification and regression, regularization), model exploration and estimation of quality. Plus computation of distances, unsupervised learning, outlier detection, data projection, and methods for parameter estimation. We have applied these to data from health care, business (which proposal on Kickstarter will succeed?), and images. Again, just like in our other data science courses, the use of Orange’s educational widgets, such as Paint Data, Interactive k-Means, and Polynomial Regression helped us in intuitive understanding of the machine learning techniques.

The course was beautifully organized by Prof. Dr. Saurabh Das with the help of Prof. Dr. Shubhra Sankar Ray and we would like to thank them for their devotion and excellent organization skills. And of course, many thanks to participating students: for an educator, it is always a great pleasure to lecture and work with highly motivated and curious colleagues that made our trip to Kolkata fruitful and fun.

Orange at Station Houston

With over 262 member companies, Station Houston is the largest hub for tech startups in Houston.

One of its members is also Genialis, a life science data exploration company that emerged from our lab and is now delivering pipelines and user-friendly apps for analytics in systems biology.

Thanks to the invitation by the director of operations Alex de la Fuente, we gave a seminar on Data Science for Everyone. We spoke about how Orange can support anyone to learn about data science and then use machine learning on their own data.

We pushed on this last point: say you walk in downtown Houston, pick first three passersby, take them to the workshop and train them in machine learning. To the point where they could walk out from the training and use some machine learning at home. Say, cluster their family photos, or figure out what Kickstarter project features to optimize to get the funding.

How long would such workshop take? Our informed guess: three hours. And of course, we illustrated this point to seminar attendees by giving a demo of the clustering of images in Orange and showcasing Kickstarter data analysis.

Related: Image Analytics: Clustering

Seminars at Station Houston need to finish with a homework. So we delivered one. Here it is:

  1. Open your browser.
  2. Find some images of your interest (mountains, cities, cars, fish, dogs, faces, whatever).
  3. Place images in a folder (Mac: just drag the thumbnails, Win: right click and Save Image).
  4. Download & install Orange. From Orange, install Image Analytics add-on (Options, Add-Ons).
  5. Use Orange to cluster images. Does clustering make sense?

Data science and startups aside: there are some beautiful views from Station Houston. From the kitchen, there is a straight sight to Houston’s medical center looming about 4 miles away.

And on the other side, there is a great view of the downtown.

It’s Sailing Time (Again)

Every fall I teach a course on Introduction to Data Mining. And while the course is really on statistical learning and its applications, I also venture into classification trees. For several reasons. First, I can introduce information gain and with it feature scoring and ranking. Second, classification trees are one of the first machine learning approaches co-invented by engineers (Ross Quinlan) and statisticians (Leo Breiman, Jerome Friedman, Charles J. Stone, Richard A. Olshen). And finally, because they make the base of random forests, one of the most accurate machine learning models for smaller and mid-size data sets.

Related: Introduction to Data Mining Course in Houston

Lecture on classification trees has to start with the data. Years back I have crafted a data set on sailing. Every data set has to have a story. Here is one:

Sara likes weekend sailing. Though, not under any condition. Past
twenty Wednesdays I have asked her if she will have any company, what
kind of boat she can rent, and I have checked the weather
forecast. Then, on Saturday, I wrote down if she actually went to the Sea.

Data on Sara’s sailing contains three attributes (Outlook, Company, Sailboat) and a class (Sail).

The data comes with Orange and you can get them from Data Sets widget (currently in Prototypes Add-On, but soon to be moved to core Orange). It takes time, usually two lecture hours, to go through probabilities, entropy and information gain, but at the end, the data analysis workflow we develop with students looks something like this:

And here is the classification tree:

Turns out that Sara is a social person. When the company is big, she goes sailing no matter what. When the company is smaller, she would not go sailing if the weather is bad. But when it is sunny, sailing is fun, even when being alone.

Related: Pythagorean Trees and Forests

Classification trees are not very stable classifiers. Even with small changes in the data, the trees can change substantially. This is an important concept that leads to the use of ensembles like random forests. It is also here, during my lecture, that I need to demonstrate this instability. I use Data Sampler and show a classification tree under the current sampling. Pressing on Sample Data button the tree changes every time. The workflow I use is below, but if you really want to see this in action, well, try it in Orange.

Outliers in Traffic Signs

Say I am given a collection of images of traffic signs, and would like to find which signs stick out. That is, which traffic signs look substantially different from the others. I would assume that the traffic signs are not equally important and that some were designed to be noted before the others.

I have assembled a small set of regulatory and warning traffic signs and stored the references to their images in a traffic-signs-w.tab data set.

Related: Viewing images

Related: Video on image clustering

Related: Video on image classification

The easiest way to display the images is by loading this data file with File widget and then passing the data to the Image Viewer,

Opening the Image Viewer allows me to see the images:

Note that initially the data table we have loaded contains no valuable features on which we can do any machine learning. It includes just a category of traffic sign, its name, and the link to its image.

We will use deep-network embedding to turn these images into numbers to describe them with 2048 real-valued features. Then, we will use Silhouette Plot to find which traffic signs are outliers in their own group. We would like to select these and visualize them in the Image Viewer.

Related: All I see is silhouette

Our final workflow, with selection of three biggest outliers (we used shift-click to select its corresponding silhouettes in the Silhouette Plot), is:

Isn’t this great? Turns out that traffic signs were carefully designed, such that the three outliers are indeed the signs we should never miss. It is great that we can now reconfirm this design choice by deep learning-based embedding and by using some straightforward machine learning tricks such as Silhouette Plot.

Orange Workshops: Luxembourg, Pavia, Ljubljana

February was a month of Orange workshops.

Ljubljana: Biologists

We (Tomaž, Martin and I) have started in Ljubljana with a hands-on course for the COST Action FA1405 Systems Biology Training School. This was a four hour workshop with an introduction to classification and clustering, and then with application of machine learning to analysis of gene expression data on a plant called Arabidopsis. The organization of this course has even inspired us for a creation of a new widget GOMapMan Ontology that was added to Bioinformatics add-on. We have also experimented with workflows that combine gene expressions and images of mutant. The idea was to find genes with similar expression profile, and then show images of the plants for which these genes have stood out.

Luxembourg: Statisticians

This workshop took place at STATEC, Luxembourgh’s National Institute of Statistics and Economic Studies. We (Anže and I) got invited by Nico Weydert, STATEC’s deputy director, and gave a two day lecture on machine learning and data mining to a room full of experienced statisticians. While the purpose was to showcase Orange as a tool for machine learning, we have learned a lot from participants of the course: the focus of machine learning is still different from that of classical statistics.

Statisticians at STATEC, like all other statisticians, I guess, value, above all, understanding of the data, where accuracy of the models does not count if it cannot be explained. Machine learning often sacrifices understanding for accuracy. With focus on data and model visualization, Orange positions itself somewhere in between, but after our Luxembourg visit we are already planning on new widgets for explanation of predictions.

Pavia: Engineers

About fifty engineers of all kinds at University of Pavia. Few undergrads, then mostly graduate students, some postdocs and even quite a few of the faculty staff have joined this two day course. It was a bit lighter that the one in Luxembourg, but also covered essentials of machine learning: data management, visualization and classification with quite some emphasis on overfitting on the first day, and then clustering and data projection on the second day. We finished with a showcase on image embedding and analysis. I have in particular enjoyed this last part of the workshop, where attendees were asked to grab a set of images and use Orange to find if they can cluster or classify them correctly. They were all kinds of images that they have gathered, like flowers, racing cars, guitars, photos from nature, you name it, and it was great to find that deep learning networks can be such good embedders, as most students found that machine learning on their image sets works surprisingly well.

Related: BDTN 2016 Workshop on introduction to data science

Related: Data mining course at Baylor College of Medicine

We thank Riccardo Bellazzi, an organizer of Pavia course, for inviting us. Oh, yeah, the pizza at Rossopommodoro was great as always, though Michella’s pasta al pesto e piselli back at Riccardo’s home was even better.

The Beauty of Random Forest

It is the time of the year when we adore Christmas trees. But these are not the only trees we, at Orange team, think about. In fact, through almost life-long professional deformation of being a data scientist, when I think about trees I would often think about classification and regression trees. And they can be beautiful as well. Not only for their elegance in explaining the hidden patterns, but aesthetically, when rendered in Orange. And even more beautiful then a single tree is Orange’s rendering of a forest, that is, a random forest.

Related: Pythagorean Trees and Forests

Here are six trees in the random forest constructed on the housing data set:

The random forest for annealing data set includes a set of smaller-sized trees:

A Christmas-lit random forest inferred from pen digits data set looks somehow messy in trying to categorize to ten different classes:

The power of beauty! No wonder random forests are one of the best machine learning tools. Orange renders them according to the idea of Fabian Beck and colleagues who proposed Pythagoras trees for visualizations of hierarchies. The actual implementation for classification and regression trees for Orange was created by Pavlin Policar.