From 15th to 19th July 2019, Orange team will hold an introductory data science course at this year’s Doctoral Summer School, organized by the School of Economics and Business, University of Ljubljana. This is the second year we will be a part of this summer school. Like the previous year, we will cover a wide variety of topics, from exploratory analysis and clustering techniques to predictive modeling and data projections. Applications are open to PHD students, post-docs, academics and professionals by the end of June.
What: Practical Introduction to Machine Learning and Data Analytics.
Two days ago we held another Introduction to Data Mining workshop at our faculty. This time the target audience was a group of public sector professionals and our challenge was finding the right data set to explain key data mining concepts. Iris is fun, but not everyone is a biologist, right? Fortunately, we found this really nice data set with ballot counts from the Slovenian National Assembly (thanks to Parlameter).
The data contains ballot counts, statistics, and description for 84 members of the parliament (MPs). First, we inspected the data in a Data Table. Each MP is described with 14 meta features and has 18 ballot counts recorded.
We have some numerical features, which means we can also inspect the data in Scatter Plot. We will plot MPs’ attendance vs. the number of their initiatives. Quite interesting! There is a big group of MPs who regularly attend the sessions, but rarely propose changes. Could this be the coalition?
The next question that springs to our mind is – can we discover interesting voting patterns from our data? Let us see. We first explored the data in Hierarchical Clustering. Looks like there are some nice clusters in our data! The blue cluster is the coalition, red the SDS party and green the rest (both from the opposition).
But it is hard to inspect so many data instances in a dendrogram. For example, we have no idea how similar are the voting records of Eva Irgl and Alenka Bratušek. Surely, there must be a better way to explore similarities and perhaps verify that voting patterns exist at even a party-level… Let us try MDS. MDS transforms multidimensional data into a 2D projection so that similar data instances lie close to each other.
Ah, this is nice! We even colored data points by the party. MDS beautifully shows the coalition (blue dots) and the opposition (all other colors). Even parties are clustered together. But there are some outliers. Let us inspect Matej Tonin, who is quite far away from his orange group. Seems like he was missing at the last two sessions and did not vote. Hence his voting is treated differently.
It is always great to inspect discovered groups and outliers. This way an expert can interpret the clusters and also explain, what outliers mean. Sometimes it is simply a matter of data (missing values), but sometimes we could find shifting alliances. Perhaps an outlier could be an MP about to switch to another party.
You can have fun with these data, too. Let us know if you discover something interesting!
This was already the second installment of Introduction to Data Mining Course at Baylor College of Medicine in Houston, Texas. Just like the last year, the course was packed. About 50 graduate students, post-docs and a few faculty attended, making the course one of the largest elective PhD courses from over a hundred offered at this prestigious medical school.
The course was designed for students with little or no experience in data science. It consisted of seven two-hour lectures, each followed by a homework assignment. We (Blaz and Janez) lectured on data visualization, classification, regression, clustering, data projection and image analytics. We paid special attention to the problems of overfitting, use of regularization, and proper ways of testing and scoring of modeling methods.
The course was hands-on. The lectures were practical. They typically started with some data set and explained data mining techniques through designing data analysis workflows in Orange. Besides some standard machine learning and bioinformatics data sets, we have also painted the data to explore, say, the benefits of different classification techniques or design data sets where k-means clustering would fail.
This year, the course benefited from several new Orange widgets. The recently published interactive k-means widget was used to explain the inner working of this clustering algorithm, and polynomial classification widget was helpful in discussion of decision boundaries of classification algorithms. Silhouette plot was used to show how to evaluate and explore the results of clustering. And finally, we explained concepts from deep learning using image embedding to show how already trained networks can be used for clustering and classification of images.