Every year BEST Ljubljana organizes BEST Days of Technology and Sciences, an event hosting a broad variety of workshops, hackathons and lectures for the students of natural sciences and technology. Introduction to Data Science, organized by our own Laboratory for Bioinformatics, was this year one of them.
The task was to teach and explain basic data mining concepts and techniques in four hours. To complete beginners. Not daunting at all…
Luckily, we had Orange at hand. First, we showed how the program works and how to easily import data into the software. We created a poll using Google Forms on the fly and imported the results from Google Sheets into Orange.
To get the first impression of our data, we used Distributions and Scatter Plot. This was just to show how to approach the construction and simple visual exploration on any new data set. Then we delved deep into the workings of classification with Classification Tree and Tree Viewer and showed how easy it is to fall into the trap of overfitting (and how to avoid it). Another topic was clustering and how to relate similar data instances to one another. Finally, we had some fun with ImageAnalytics add-on and observed whether we can detect wrongly labelled microscopy images with machine learning.
The new Orange release (v. 3.3.9) welcomed a few wonderful additions to its widget family, including Manifold Learning widget. The widget reduces the dimensionality of the high-dimensional data and is thus wonderful in combination with visualization widgets.
Manifold Learning widget offers five embedding techniques based on scikit-learn library: t-SNE, MDS, Isomap, Locally Linear Embedding and Spectral Embedding. They each handle the mapping differently and also have a specific set of parameters.
For example, a popular t-SNE requires only a metric (e.g. cosine distance). In the demonstration of this widget, we output 2 components, since they are the easiest to visualize and make sense of.
First, let’s load the data and open it in Scatter Plot. Not a very informative visualization, right? The dots from an unrecognizable square in 2D.
Let’s use embeddings to make things a bit more informative. This is how the data looks like with a t-SNE embedding. The data is starting to have a shape and the data points colored according to regression class reveal a beautiful gradient.
Ok, how about MDS? This is beyond our expectations!
There’s a plethora of options with embeddings. You can play around with ImageNet embeddings and plot them in 2D or use any of your own high-dimensional data and discover interesting visualizations! Although t-SNE is nowadays probably the most popular dimensionality reduction technique used in combination with scatterplot visualization, do not underestimate the value of other manifold learning techniques. For one, we often find that MDS works fine as well.
Recently we’ve been participating at Days of Computer Science, organized by the Museum of Post and Telecommunications and the Faculty of Computer and Information Science, University of Ljubljana, Slovenia. The project brought together pupils and students from around the country and hopefully showed them what computer science is mostly about. Most children would think programming is just typing lines of code. But it’s more than that. It’s a way of thinking, a way to solve problems creatively and efficiently. And even better, computer science can be used for solving a great variety of problems.
Orange team has prepared a small demo project called Celebrity Lookalike. We found 65 celebrity photos online and loaded them in Orange. Next we cropped photos to faces and turned them black and white, to avoid bias in background and color. Next we inferred embeddings with ImageNet widget and got 2048 features, which are the penultimate result of the ImageNet neural network.
Still, we needed a reference photo to find the celebrity lookalike for. Students could take a selfie and similarly extracted black and white face out of it. Embeddings were computed and sent to Neighbors widget. Neighbors finds n closest neighbors based on the defined distance measure to the provided reference. We decided to output 10 closest neighbors by cosine distance.
Finally, we used Lookalike widget to display the result. Students found it hilarious when curly boys were the Queen of England and girls with glasses Steve Jobs. They were actively trying to discover how the algorithm works by taking photo of a statue, person with or without glasses, with hats on or by making a funny face.
Hopefully this inspires a new generation of students to become scientists, researchers and to actively find solutions to their problems. Coding or not. 🙂
Note: Most widgets we have designed for this projects (like Face Detector, Webcam Capture, and Lookalike) are available in Orange3-Prototypes and are not actively maintained. They can, however, be used for personal projects and sheer fun. Orange does not own the copyright of the images.