Did you know that the widget for support vector machines (SVM) classifier can output support vectors? And that you can visualise these in any other Orange widget? In the context of all other data sets, this could provide some extra insight into how this popular classification algorithm works and what it actually does.
Ideally, that is, in the case of linear seperability, support vector machines (SVM) find a hyperplane with the largest margin to any data instance. This margin touches a small number of data instances that are called support vectors.
In Orange 3.0 you can set the SVM classification widget to output also the support vectors and visualize them. We used Iris data set in the File widget and classified data instances with SVM classifier. Then we connected both widgets with Scatterplot and selected Support Vectors in the SVM output channel. This allows us to see support vectors in the Scatterplot widget – they are represented by the bold dots in the graph.
Even though the summer is nigh, we are hardly going to catch a summer break this year. Orange team is busy holding workshops around the world to present the latest widgets and data mining tools to the public. Last week we had a very successful tutorial at [BC]2 in Basel, Switzerland, where Marinka and Blaž presented data fusion. A part of the tutorial was a hands-on workshop with Orange’s new add-on for data fusion. Marinka also got an award for the poster, where data fusion was used to hunt for Dictyostelium bacterial-response genes. This week, we are in Pavia, Italy, also for Matrix Computations in Biomedical Informatics Workshop at AIME 2015, a Conference on Artificial Intelligence in Medicine. During the workshop, we are giving an invited talk on learning latent factor models by data fusion and we’ll also show Orange’s data fusion add-on. Thanks to the workshop organizers, Riccardo Bellazzi, Jimeng Sun and Ping Zhang, the workshop program looks great.
We design the tutorial for data mining researchers and molecular biologists with interest in large-scale data integration. In the tutorial we focus on collective latent factor models, a popular class of approaches for data fusion. We demonstrate the effectiveness of these approaches on several hands-on case studies from recommendation systems and molecular biology.
This is a high-risk event. I mean, for us, lecturers. Ok, no bricks will probably fall down. But, in the part of the tutorial, this is the first time we are showing Orange’s data fusion add-on. And not just showing: part of the tutorial is a hands-on session.
We would like to acknowledge Biolab members for pushing the widgets through the development pipeline under extreme time constraints. Special thanks to Anze, Ales, Jernej, Andrej, Marko, Aleksandar and all other members of the lab.
Orange is about to get even more exciting! We have created a prototype add-on for data fusion, which will certainly be of interest to many users. Data fusion brings large heterogeneous data sets together to create sensible clusters of related data instances and provides a platform for predictive modelling and recommendation systems.
This widget set can be used either to recommend you the next movie to watch based on your demographic characteristics, movies you gave high scores to, your preferred genre, etc. or to suggest you a set of genes that might be relevant for a particular biological function or process. We envision the add-on to be useful for predictive modeling dealing with large heterogeneous data compendia, such as life sciences.
The prototype set will be available for download next week, but we are happy to give you a sneak peek below.
Movie Ratings widget is pre-set to offer data on movie ratings by users with 706 users and 855 movies (10% of the data selected as a subset).
We add IMDb Actors to filter the data by matching movie ratings with actors.
Then we add the Fusion Graph widget to fuse the data together. Here we have two object types, i.e. users and movies, and one relation between them, i.e. movie ratings.
In Latent Factors we see latent data representation demonstrated by red squares at the side. Let’s select a latent matrix associated with Users as our input for the Data Table.
In Data Table we see the latent data matrix of Users. The algorithm infers low-dimensional user profiles by collective consideration of entire data collection, i.e. movie ratings and actor information. In our scenario the algorithm has transformed 855 movie titles into 70 movie groupings, i.e. latent components.
Orange 3.0 version comes with an exciting feature that will simplify reading your data. If the old Orange required conversion from Excel into either tab-delimited or comma-separated files, the new version allows you to open plain .xlsx format data sets in the program. Naturally, the .txt and .csv files are still readable in Orange, so feel free to use data sets in any of the above-mentioned formats.
Since Orange 3.0 is still in the development mode, you will find a smaller selection of widgets available at the moment, but give it a go and see how it works for Excel type data and whether the existing widgets are sufficient for your data analysis. Please find the daily build for OSX here.
You might think “casual Fridays” are the best thing since sliced bread. But what if I were to tell you we have “Orange Fridays” at our lab, where lab members focus solely on debugging Orange software and making improvements to existing features. This is because the new developing version of Orange (3.0) still needs certain widgets to be implemented, such as net explorer, radviz, and survey plot.
But there’s more. We are currently hosting an expert on data fusion from the University of Leuven, prof. dr. Yves Moreau, to discuss new venues and niches for the development of Orange. The big debate is how to scale the program to fit large data sets and make it possible to process such sets in a shorter period of time. If you have any ideas and suggestions, please feel free to share them on our community forum.
Orange 3 is slowly, but steadily, gaining support for working with data stored in a SQL database. The main focus is to allow huge data sets that do not fit into RAM to be analyzed and visualized efficiently. Many widgets already recognize the type of input data and perform the necessary computations intelligently. This means that data is not downloaded from the database and analyzed locally, but is retained on the remote server, with the computation tasks translated into SQL queries and offloaded to the database engine. This approach takes advantage of the state-of-the-art optimizations relational databases have for working with data that does not fit into working memory, as well as minimizes the transfer of required information to the client.
We demonstrate how to explore and visualize data stored in a SQL table on a remote server in the following short video. It shows how to connect to the server and load the data with the SqlTable widget, manipulate the data (Select Columns, Select Rows), obtain the summary statistics (Box plot, Distributions), and visualize the data (Heat map, Mosaic Display).
The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 318633
These days, we (Blaz Zupan and Marinka Zitnik, with full background support of entire Bioinformatics Lab) are running a three-day course on Data Mining in Python. Riccardo Bellazzi, a professor at University of Pavia, a world-renown researcher in biomedical informatics, and most of all, a great friend, has invited us to run the elective course for Pavia’s grad students. The enrollment was, he says, overwhelming, as with over 50 students this is by far the best attended grad course at Pavia’s faculty of engineering in the past years.
We have opted for the hands-on course and a running it as a workshop. The lectures include a new, development version of Orange 3, and mix it with numpy, scikit-learn, matplotlib, networkx and bunch of other libraries. Course themes are classification, clustering, data projection and network analysis.
We are rushing, full speed ahead, towards Orange 3. A complete revamp of Orange in Python 3 changes its data model to that of numpy, making Orange compatible with an array of Python-based data analytics. We are rewriting all the widgets for visual programming as well. We have two open fronts: the scripting part, and the widget part. So much to do, but it is going well: the closed tasks for widgets are those on the left of Anze (the board full of sticky notes), and those open, in minority, are on Anze’s right. Oh, by the way, it’s Anze who is managing the work and he looks quite happy.
By a popular demand, we have just published a tutorial on how to load the data table into Orange. Besides its own .tab format, Orange can load any tab or comma delimited data set. The details are though in writing header rows that tell Orange about the type and domain of each attribute. The tutorial is a step-by-step description on how to do this and how to transfer the data from popular spreadsheet programs like Excel.