We all know that sometimes many is better than few. Therefore we are happy to introduce the Stack widget. It is available in Prototypes add-on for now.
Stacking enables you to combine several trained models into one meta model and use it in Test&Score just like any other model. This comes in handy with complex problems, where one classifier might fail, but many could come up with something that works. Let’s see an example.
We start with something as complex as this. We used Paint Data to create a complex data set, where classes somewhat overlap. This is naturally an artificial example, but you can try the same on your own, real life data.
Then we add several kNN models with different parameters, say 5, 10 and 15 neighbors. We connect them to Test&Score and use cross validation to evaluate their performance. Not bad, but can we do even better?
Let us try stacking. We will connect all three classifiers to the Stacking widget and use Logistic Regression as an aggregate, a method that aggregates the three models into a single meta model. Then we connect connect the stacked model into Test&Score and see whether our scores improved.
And indeed they have. It might not be anything dramatic, but in real life, say medical context, even small improvements count. Now go and try the procedure on your own data. In Orange, this requires only a couple of minutes.
The Orange3 Network add-on contains a convenient Network Explorer widget for network visualization. Orange uses an iterative force-directed method (a variation of the Fruchterman-Reingold Algorithm) to layout the nodes on the 2D plane.
The goal of force-directed methods is to draw connected nodes close to each other as if the edges that connect the nodes were acting as springs. We also don’t want all nodes crowded in a single point, but would rather have them spaced evenly. This is achieved by simulating a repulsive force, which decreases with the distance between nodes.
There are two types of forces acting on each node:
the attractive force towards connected adjacent nodes,
the repulsive force that is directed away from all other nodes.
We could say that such network visualization as a whole is rather repulsive. Let’s take for example the lastfm.net network that comes with Orange’s network add-on and which has around 1.000 nodes and 4.000 edges. In every iteration, we have to consider 4.000 attractive forces and 1.000.000 repulsive forces for every of 1.000 times 1.000 edges. It takes about 100 iterations to get a decent network layout. That’s a lot of repulsions, and you’ll have to wait a while before you get the final layout.
Fortunately, we found a simple hack to speed things up. When computing the repulsive force acting on some node, we only consider a 10% sample of other nodes to obtain an estimate. We multiply the result by 10 and hope it’s not off by too much. By choosing a different sample in every iteration we also avoid favoring some set of nodes.
The left layout is obtained without sampling while the right one uses a 10% sampling. The results are pretty similar, but the sampling method is 10 times faster!
Now that the computation is fast enough, it is time to also speed-up the drawing. But that’s a task for 2018.
On Monday we finished the second part of the workshop for the Statistical Office of Republic of Slovenia. The crowd was tough – these guys knew their numbers and asked many challenging questions. And we loved it!
One thing we discussed was how to properly test your model. Ok, we know never to test on the same data you’ve built your model with, but even training and testing on separate data is sometimes not enough. Say I’ve tested Naive Bayes, Logistic Regression and Tree. Sure, I can select the one that gives the best performance, but we could potentially (over)fit our model, too.
To account for this, we would normally split the data to 3 parts:
training data for building a model
validation data for testing which parameters and which model to use
test data for estmating the accurracy of the model
Let us try this in Orange. Load heart-disease.tab data set from Browse documentation data sets in File widget. We have 303 patients diagnosed with blood vessel narrowing (1) or diagnosed as healthy (0).
Now, we will split the data into two parts, 85% of data for training and 15% for testing. We will send the first 85% onwards to build a model.
We will use Naive Bayes, Logistic Regression and Tree, but you can try other models, too. This is also a place and time to try different parameters. Now we will send the models to Test & Score. We used cross-validation and discovered Logistic Regression scores the highest AUC. Say this is the model and parameters we want to go with.
Now it is time to bring in our test data (the remaining 15%) for testing. Connect Data Sampler to Test & Score once again and set the connection Remaining Data – Test Data.
Test & Score will warn us we have test data present, but unused. Select Test on test data option and observe the results. These are now the proper scores for our models.
Seems like LogReg still performs well. Such procedure would normally be useful when testing a lot of models with different parameters (say +100), which you would not normally do in Orange. But it’s good to know how to do the scoring properly. Now we’re off to report on the results in Nature… 😉
We’ve been having a blast with recent Orange workshops. While Blaž was getting tanned in India, Anže and I went to the charming Liverpool to hold a session for business school professors on how to teach business with Orange.
Obviously, when we say teach business, we mean how to do data mining for business, say predict churn or employee attrition, segment customers, find which items to recommend in an online store and track brand sentiment with text analysis.
For this purpose, we have made some updates to our Associate add-on and added a new data set to Data Sets widget which can be used for customer segmentation and discovering which item groups are frequently bought together. Like this:
We load the Online Retail data set.
Since we have transactions in rows and items in columns, we have to transpose the data table in order to compute distances between items (rows). We could also simply ask Distances widget to compute distances between columns instead of rows. Then we send the transposed data table to Distances and compute cosine distance between items (cosine distance will only tell us, which items are purchased together, disregarding the amount of items purchased).
Finally, we observe the discovered clusters in Hierarchical Clustering. Seems like mugs and decorative signs are frequently bought together. Why so? Select the group in Hierarchical Clustering and observe the cluster in a Data Table. Consider this an exercise in data exploration. 🙂
The second workshop was our standard Introduction to Data Mining for Ministry of Public Affairs.
This group, similar to the one from India, was a pack of curious individuals who asked many interesting questions and were not shy to challenge us. How does a Tree know which attribute to split by? Is Tree better than Naive Bayes? Or is perhaps Logistic Regression better? How do we know which model works best? And finally, what is the mean of sauerkraut and beans? It has to be jota!
Workshops are always fun, when you have a curious set of individuals who demand answers! 🙂
Indian Statistical Institute lies in the hearth of old Kolkata. A peaceful oasis of picturesque campus with mango orchards and waterlily lakes was founded by Prof. Prasanta Chandra Mahalanobis, one of the giants of statistics. Today, the Institute researches statistics and computational approaches to data analysis and runs a grad school, where a rather small number of students are hand-picked from tens of thousands of applicants.
The course was hands-on. The number of participants was limited to forty, the limitation posed by the number of the computers in Institute’s largest computer lab. Half of the students came from Institute’s grad school, and another half from other universities around Kolkata or even other schools around India, including a few participants from another famous institution, India Institutes of Technology. While the lecture included some writing on the white-board to explain machine learning, the majority of the course was about exploring example data sets, building workflows for data analysis, and using Orange on practical cases.
The course was not one of the lightest for the lecturer (Blaž Zupan). About five full hours each day for five days in a row, extremely motivated students with questions filling all of the coffee breaks, the need for deeper dive into some of the methods after questions in the classroom, and much need for improvisation to adapt our standard data science course to possibly the brightest pack of data science students we have seen so far. We have covered almost a full spectrum of data science topics: from data visualization to supervised learning (classification and regression, regularization), model exploration and estimation of quality. Plus computation of distances, unsupervised learning, outlier detection, data projection, and methods for parameter estimation. We have applied these to data from health care, business (which proposal on Kickstarter will succeed?), and images. Again, just like in our other data science courses, the use of Orange’s educational widgets, such as Paint Data, Interactive k-Means, and Polynomial Regression helped us in intuitive understanding of the machine learning techniques.
The course was beautifully organized by Prof. Dr. Saurabh Das with the help of Prof. Dr. Shubhra Sankar Ray and we would like to thank them for their devotion and excellent organization skills. And of course, many thanks to participating students: for an educator, it is always a great pleasure to lecture and work with highly motivated and curious colleagues that made our trip to Kolkata fruitful and fun.
We know you’ve missed it. We’ve been getting many requests to bring back Neural Network widget, but we also had many reservations about it.
Neural networks are powerful and great, but to do them right is not straight-forward. And to do them right in the context of a GUI-based visual programming tool like Orange is a twisted double helix of a roller coaster.
Do we make each layer a widget and then stack them? Do we use parallel processing or try to do something server-side? Theano or Keras? Tensorflow perhaps?
We were so determined to do things properly, that after the n-th iteration we still had no clue what to actually do.
Then one day a silly novice programmer (a.k.a. me) had enough and just threw scikit-learn’s Multi-layer Perceptron model into a widget and called it a day. There you go. A Neural Network widget just like it was in Orange2 – a wrapper for a scikit’s function that works out-of-the-box. Nothing fancy, nothing powerful, but it does its job. It models things and it predicts things.
Our streak of workshops continues. This time we taught professionals from public administration how they can leverage data analytics and machine learning to retrieve interesting information from surveys. Thanks to the Ministry of Public Administration, this is only the first in a line of workshops on data science we are preparing for public sector employees.
For this purpose, we have designed EnKlik Anketa widget, which you can find in Prototypes add-on. The widget reads data from a Slovenian online survey service OneClick Survey and imports the results directly into Orange.
We have prepared a test survey, which you can import by entering a public link to data into the widget. Here’s the link: https://www.1ka.si/podatki/141025/72F5B3CC/ . Copy it into the Public link URL line in the widget. Once you press Enter, the widget loads the data and displays retrieved features, just like the File widget.
The survey is in Slovenian, but we can use Edit Domain to turn feature names into English equivalent.
As always, we can check the data in a Data Table. We have 41 respondents and 7 questions. Each respondent chose a nickname, which makes it easier to browse the data.
Now we can perform familiar clustering to uncover interesting groups in our data. Connect Distances to Edit Domain and Hierarchical Clustering to Distances.
We have two outliers, Pipi and Chad. One is an excessive sportsman (100 h of sport per week) and the other terminally ill (general health -1). Or perhaps they both simply didn’t fill out the survey correctly. If we use the Data Table to filter out Pipi and Chad, we get a fairly good clustering.
We can use Box Plot, to observe what makes each cluster special. Connect Box Plot to Hierarchical Clustering (with the two groups selected), select grouping by Cluster and tick Order by relevance.
Seems like our second cluster (C2) is the sporty one. If we are serving in the public administration, perhaps we can design initiatives targeting cluster C1 to do more sports. It is so easy to analyze the data in Orange!
Last week, we presented Orange at the Festival of Open Data, a mini-conference organized by the Slovenian government, dedicated to the promotion of transparent access to government data. In a 10 minute presentation, we showed how Orange can be used to visualize and explore what kinds of vehicles were registered for the first time in Slovenia in 2017.
When exploring the data, the first thing we do is take a look at distributions. If we observe the distribution of new and used cars bought by the gender of the buyer, we can see that men prefer used cars while women more often opt for a new car. Or we can observe the distribution by age to see that older people tend to buy newer cars.
But the true power of Orange can be seen if we visualize the data on a map. In order to do this, we need to first use Geocoding to map municipality names to regions which can be shown on a map by choosing the column that contains municipality name (C1.3-Obcina uporabnika) and clicking apply. Since municipalities in Slovenia are created all the time, not all of them can be matched. The right part of the widget allows us to map these small municipalities to the nearest region. Or we can just ignore them.
The geocoded data can be displayed with Choropleth. If we select attribute D.1-Znamka and aggregation by mode, we get a visualization showing the most frequently bought mode for each region. Care to guess which manufacturer corresponds to the pink(-ish) color? It’s Volkswagen, in some regions with Golf and in other regions with Passat. But the visualization gives us just the most frequent value for each municipality. What if we would like to know more? As is the case with all visualizations you can click on a specific region on a map to select it and get the corresponding data on the output. We can then use Purge Domain to ignore the models that were not sold in the selected region and Box Plot to visualize the distribution by the model or by the manufacturer.
In Box Plot, select D.1 Znamka as both the variable and Subgroup and you get an overview of the distribution of cars by manufacturers in the selected region. But that is just the first step. We can also take a look at the distribution of Fiat cars by adding another boxplot. Now you can select the manufacturer and get a detailed distribution of specific car models sold. If you take some care in positioning the windows, you can create an interactive explorer, where you click on regions and instantly see the detailed distributions in the connected boxplots.
Two days ago we held another Introduction to Data Mining workshop at our faculty. This time the target audience was a group of public sector professionals and our challenge was finding the right data set to explain key data mining concepts. Iris is fun, but not everyone is a biologist, right? Fortunately, we found this really nice data set with ballot counts from the Slovenian National Assembly (thanks to Parlameter).
The data contains ballot counts, statistics, and description for 84 members of the parliament (MPs). First, we inspected the data in a Data Table. Each MP is described with 14 meta features and has 18 ballot counts recorded.
We have some numerical features, which means we can also inspect the data in Scatter Plot. We will plot MPs’ attendance vs. the number of their initiatives. Quite interesting! There is a big group of MPs who regularly attend the sessions, but rarely propose changes. Could this be the coalition?
The next question that springs to our mind is – can we discover interesting voting patterns from our data? Let us see. We first explored the data in Hierarchical Clustering. Looks like there are some nice clusters in our data! The blue cluster is the coalition, red the SDS party and green the rest (both from the opposition).
But it is hard to inspect so many data instances in a dendrogram. For example, we have no idea how similar are the voting records of Eva Irgl and Alenka Bratušek. Surely, there must be a better way to explore similarities and perhaps verify that voting patterns exist at even a party-level… Let us try MDS. MDS transforms multidimensional data into a 2D projection so that similar data instances lie close to each other.
Ah, this is nice! We even colored data points by the party. MDS beautifully shows the coalition (blue dots) and the opposition (all other colors). Even parties are clustered together. But there are some outliers. Let us inspect Matej Tonin, who is quite far away from his orange group. Seems like he was missing at the last two sessions and did not vote. Hence his voting is treated differently.
It is always great to inspect discovered groups and outliers. This way an expert can interpret the clusters and also explain, what outliers mean. Sometimes it is simply a matter of data (missing values), but sometimes we could find shifting alliances. Perhaps an outlier could be an MP about to switch to another party.
You can have fun with these data, too. Let us know if you discover something interesting!
With over 262 member companies, Station Houston is the largest hub for tech startups in Houston.
One of its members is also Genialis, a life science data exploration company that emerged from our lab and is now delivering pipelines and user-friendly apps for analytics in systems biology.
Thanks to the invitation by the director of operations Alex de la Fuente, we gave a seminar on Data Science for Everyone. We spoke about how Orange can support anyone to learn about data science and then use machine learning on their own data.
We pushed on this last point: say you walk in downtown Houston, pick first three passersby, take them to the workshop and train them in machine learning. To the point where they could walk out from the training and use some machine learning at home. Say, cluster their family photos, or figure out what Kickstarter project features to optimize to get the funding.
How long would such workshop take? Our informed guess: three hours. And of course, we illustrated this point to seminar attendees by giving a demo of the clustering of images in Orange and showcasing Kickstarter data analysis.