Every year BEST Ljubljana organizes BEST Days of Technology and Sciences, an event hosting a broad variety of workshops, hackathons and lectures for the students of natural sciences and technology. Introduction to Data Science, organized by our own Laboratory for Bioinformatics, was this year one of them.
The task was to teach and explain basic data mining concepts and techniques in four hours. To complete beginners. Not daunting at all…
Luckily, we had Orange at hand. First, we showed how the program works and how to easily import data into the software. We created a poll using Google Forms on the fly and imported the results from Google Sheets into Orange.
To get the first impression of our data, we used Distributions and Scatter Plot. This was just to show how to approach the construction and simple visual exploration on any new data set. Then we delved deep into the workings of classification with Classification Tree and Tree Viewer and showed how easy it is to fall into the trap of overfitting (and how to avoid it). Another topic was clustering and how to relate similar data instances to one another. Finally, we had some fun with ImageAnalytics add-on and observed whether we can detect wrongly labelled microscopy images with machine learning.
We have covered some basic data visualization, modeling (classification) and model scoring, hierarchical clustering and data projection, and finished with a touch of deep-learning by diving into image analysis by deep learning-based embedding.
It’s not easy to pack so many new things on data analytics within four hours, but working with Orange helps. This was a hands-on workshop. Students brought their own laptops with Orange and several of its add-ons for bioinformatics and image analytics. We also showed how to prepare one’s own data using Google Forms and designed a questionary, augment it in a class, run it with students and then analyze the questionary with Orange.
The hard part of any short course that includes machine learning is how to explain overfitting. The concept is not trivial for data science newcomers, but it is so important it simply cannot be left out. Luckily, Orange has some cool widgets to help us understanding the overfitting. Below is a workflow we have used. We read some data (this time it was a yeast gene expression data set called brown-selected that comes with Orange), “destroyed the data” by randomly permuting the column with class values, trained a classification tree, and observed near perfect results when the model was checked on the training data.
Sure this works, you are probably saying. The models should have been scored on a separate test set! Exactly, and this is what we have done next with Data Sampler, which lead us to cross-validation and Test & Score widget.
This was a great and interesting short course and we were happy to contribute to the success of the student-run MLSC-2016 conference.
Even though the summer is nigh, we are hardly going to catch a summer break this year. Orange team is busy holding workshops around the world to present the latest widgets and data mining tools to the public. Last week we had a very successful tutorial at [BC]2 in Basel, Switzerland, where Marinka and Blaž presented data fusion. A part of the tutorial was a hands-on workshop with Orange’s new add-on for data fusion. Marinka also got an award for the poster, where data fusion was used to hunt for Dictyostelium bacterial-response genes. This week, we are in Pavia, Italy, also for Matrix Computations in Biomedical Informatics Workshop at AIME 2015, a Conference on Artificial Intelligence in Medicine. During the workshop, we are giving an invited talk on learning latent factor models by data fusion and we’ll also show Orange’s data fusion add-on. Thanks to the workshop organizers, Riccardo Bellazzi, Jimeng Sun and Ping Zhang, the workshop program looks great.
Actually, there was a lot of programming, but no Python or alike. The workshop was designed for biomedical students and Baylor’s faculty members. We have presented a visual programming approach for development of data mining workflows for interactive data exploration. A three-hour workshop consisted of 15 data mining lessons on visual data exploration, classification, clustering, network analysis, and gene expression analytics. Each lesson focused on a particular data analysis task that the attendees solved with Orange.