Orange in Pavia, Italy

These days, we (Blaz Zupan and Marinka Zitnik, with full background support of entire Bioinformatics Lab) are running a three-day course on Data Mining in Python. Riccardo Bellazzi, a professor at University of Pavia, a world-renown researcher in biomedical informatics, and most of all, a great friend, has invited us to run the elective course for Pavia’s grad students. The enrollment was, he says, overwhelming, as with over 50 students this is by far the best attended grad course at Pavia’s faculty of engineering in the past years.

We have opted for the hands-on course and a running it as a workshop. The lectures include a new, development version of Orange 3, and mix it with numpy, scikit-learn, matplotlib, networkx and bunch of other libraries. Course themes are classification, clustering, data projection and network analysis.

pavia-group

pavia-rail

pavia-class

Towards Orange 3

We are rushing, full speed ahead, towards Orange 3. A complete revamp of Orange in Python 3 changes its data model to that of numpy, making Orange compatible with an array of Python-based data analytics. We are rewriting all the widgets for visual programming as well. We have two open fronts: the scripting part, and the widget part. So much to do, but it is going well: the closed tasks for widgets are those on the left of Anze (the board full of sticky notes), and those open, in minority, are on Anze’s right. Oh, by the way, it’s Anze who is managing the work and he looks quite happy.

anze-happy

Loading your data

By a popular demand, we have just published a tutorial on how to load the data table into Orange. Besides its own .tab format, Orange can load any tab or comma delimited data set. The details are though in writing header rows that tell Orange about the type and domain of each attribute. The tutorial is a step-by-step description on how to do this and how to transfer the data from popular spreadsheet programs like Excel.

Hands-on Orange at Functional Genomics Workshop

Last week we have co-organized a Functional Genomics Workshop. At University of Ljubljana we have hosted an inspiring pack of scientists from the Donnelly Centre for Cellular and Biomolecular Research from Toronto. Part of the event was a hands-on workshop Data mining without programing, where we have used Orange to analyze data from systems biology. Data included a subset of Charlie Boone’s famous yeast interaction data and data from chemical genomics. For the program, info about the speakers, and panckages and šmorn check out workshop’s newspaper.

It is always a pleasure seeing a packed lecture room with all laptops running Orange. Attendees were assisted by members of the Biolab in Ljubljana. Hands-on program followed a set of short lectures we have crafted for intended audience – biologists. Everything ran smoothly. At the end, we got excited enough to promise a data import wizard for all those that have problems annotating the data with feature type tags. The deadline: two weeks from the end of the workshop.

fg-orange

Orange Canvas applied to x-ray optics

Orange Canvas is being appropriated by guys who would like to use it as graphical environment for simulating x-ray optics.

Manuel Sanchez del Rio, from The European Synchrotron Facility in Grenoble, France, and Luca Rebuffi from Elettra-Sincrotrone, Trieste, Italy, were looking for a tool that would help them integrate the various tools for x-ray optics simulations, like the popular SHADOW and SRW. They discovered that the data workflow paradigm, like the one used in Orange Canvas, fits their needs perfectly. They took Orange, and replaced the existing widgets with new widgets that represent sources of photons (bending magnets, in the case of ESRF), various optical elements, like lenses and mirrors, and detectors. The channels between the widgets no longer pass data tables, like in the standard Orange, but rays of photons. How cool is this?

The result is a system in which the user can arrange the elements in a system that resembles the actual physical system, and then run the simulations using the most powerful tools available in x-ray optics.

The tool prototype has been presented at the SPIE Optics + Photonic 2014 in San Diego, the largest meeting of its kind.

We’re really excited about this novel use of Orange Canvas.

spie.jpg

Orange and SQL

Orange 3.0 will also support working with data stored in a database.

While we have already talked about this some time ago, we here describe some technical details for anybody interested. This is not a thorough tecnical report, its purpose is only to provide an impression about the architecture of the upcoming version of Orange.

So, data tables in Orange 3.0 can refer to data in the working memory or in the database. Any (properly written) code that uses tables should work the same with both storages. When the data is stored in the database, the table is implemented as a “proxy object” with the necessary meta-data for constructing the SQL query to retrieve the data when needed. Operations on the data only modify the meta-data without retrieving any actual data. For instance, construction of a new table with some selected data subset, say all instances that match a certain condition, creates a new proxy with additional conditions for the WHERE clause. Similarly, selecting a subset of features only changes the domain (the list of features), which is later reflected in the columns of the SELECT clause.

Features in this model are no longer described just with their names but also with the part which goes into the query that retrieves or constructs their values. Discretization, for instance, constructs new features which wrap the representation of the continuous features into a CASE statement that assigns a value based on the boundaries of the bins.

Since the goal was to make the code in modules and widgets oblivious to the storage, we also needed separate implementation of the operations that need to be aware of how the data is stored. For instance, the code that computes the average values of attributes needs to be different for the two storages: for the in-memory data we need to use the corresponding numpy functions and for databases the average is computed on the server.

We went through the code of Orange 2.7 and identified the common operations on the data. We found that all data access belongs into the following types:

  1. basic aggregates like mean, variance, median, minimal and maximal value,
  2. distributions of discrete and continuous variables, values at percentiles,
  3. contingency matrices,
  4. covariance matrices,
  5. filtering of rows based on various criteria, including random sampling,
  6. selection of columns,
  7. construction of variables from values of other variables,
  8. matrices of distances (e.g. Euclidean) between all row pairs,
  9. individual data rows.

Points 1 to 4 are typical examples of what cannot be done on client but can be efficiently done in the database. The storage (a class derived from Table) now provides specialized methods for computing aggregates, distributions and contingencies, which use numpy for in-memory data and SQL for the data on the database.

Points 5 to 7 are implemented “lazily”, by modifying the SQL query describing the data as described above.

Point 8 is difficult to implement efficiently in common relational databases and, besides, results in a data matrix that is larger than the actual data. Methods that require such a matrix will need to be reimplemented and be aware of the storage mechanism.

Point 9 requires some caution with regard to how the data is retrieved and what it is used for. Access to individual rows should be used sparingly. Sequential retrieval – especially of all rows – needs to be avoided. For efficiency, most methods that did so in the previous versions of Orange will need to be reimplemented to use aggregate data (possibly as approximations) or to be aware of the data storage and execute some operations directly through SQL.

We have already ported a number of visualizations and other widgets to the new Orange. Here is one nice example: Mosaic needs to discretize the variables and then compute contingency matrices for discrete variables. Within the above scheme, the widget does not care about the storage mechanism, yet its computation is still as efficient as possible.

mosaic.png

The described activities were funded in part by the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.

Workshops at Baylor College of Medicine

On May 22nd and May 23rd, we (Blaz Zupan and Janez Demsar, assisted by Marinka Zitnik and Balaji Santhanam) have given two hands-on workshops called Data Mining without Programming at Baylor College of Medicine in Houston, Texas.

Actually, there was a lot of programming, but no Python or alike. The workshop was designed for biomedical students and Baylor’s faculty members. We have presented a visual programming approach for development of data mining workflows for interactive data exploration. A three-hour workshop consisted of 15 data mining lessons on visual data exploration, classification, clustering, network analysis, and gene expression analytics. Each lesson focused on a particular data analysis task that the attendees solved with Orange.

The two workshops were organized by Baylor’s Computational and Integrative Biomedical Research Center. Over two days, the event was attended by a large audience of 120 attendees.

workshop-a.jpg

workshop-b.jpg

Viewing Images

I am lately having fun with Image Viewer. The widget has been recently updated and can display images stored locally or on the internet. But wait, what images? How on earth can Orange now display images if it can handle mere tabular or basket-based data?

Here’s an example. I have considered a subset of animals from the zoo.tab data set (comes with Orange installation), and for demonstration purposes selected only a handful of attributes. I have added a new string attribute (“images”) and declared that this is a meta attribute of the type “image”. The values of this attribute are links to images on the web:

animals-dataset.png

Here is the resulting data set, zoo-with-images.tab. I have used this data set in a schema with hierarchical clustering, where upon selection of the part of the clustering tree I can display the associated images:

animals-schema.png

Typically and just like above, you would use a string meta attribute to store the link to images. Images can be referred to using a HTTP address, or, if stored locally, using a relative path from the data file location to the image files.

Here is another example, where all the images were local and we have associated them with a famous digits data set (digits.zip is a data set in the Orange format with the image files). The task for this data set is to classify handwritten digits based on their bitmap representation. In the schema below we wanted to find out which are the most frequent errors some classification algorithm would make, and how do the images of the misclassified digits look like. Turns out that SVM with RBF kernel most often misclassify the digit 9 and confuses it with a digit 3:

digits-schema.png

Paint Your Data

One of the widgets I very much enjoy when teaching introductory coursein data mining is Paint Data widget. In the data I would paint in thiswidget I would intentionally include some clusters, or intentionallyobscure them. Or draw them in any strange shape. Then I would discusswith students if these clusters are identified by k-means clustering. Or by hierarchical clustering. We would also discussautomatic scoring of the quality of clusters, come up with the idea ofa silhouette (ok, already invented, but helps if you get this idea onyour own as well). And then we would play with various data sets andclustering techniques and their parameters in Orange.

Like in the following workflow where I drew three clusters that wereindeed recognized by k-means clustering. Notice that silhouettescoring correctly identified even the number of clusters was guessedcorrectly. And I also drawn the clustered data in the Scatterplot tocheck if the clusters are indeed where they should be.

PaintData-k-Means-ok.png

Or like in the workflow below where k-means fails miserably (but someother clustering technique would not).

PaintData-k-Means-notok.png

Paint Data can also be used in supervised setting, for classificationtasks. We can set the intended number of classes, and then chose anyof these to paint its data. Below I have used it to create the datasets to check the behavior of several classifiers.

PaintData-Supervised.png

There are tons of other workflows where Paint Data can be useful. Giveit a try!

Brief History of Orange, Praise to Donald Michie

Informatica has recently published our paper on the history of Orange. The paper is a post-publication from a Conference on 100 Years of Alan Turing and 20 Years of Slovene AI Society, where Janez Demšar gave a talk on the topics.

History of Orange goes all the way back to 1997, when late Donald Michie had an idea that machine learning needs an open toolbox for machine learning. To spark the development, we co-organized WebLab97 at beautiful Bled, Slovenia. Workshop’s name reflected Michie’s idea that tool should be a web application where people can submit data mining code, procedures, testing scripts, and data and share them in the joint web workspace.

Donald Michie, a pioneer of Artificial Intelligence, was always ahead of time. (Check out a great talk by Ivan Bratko on their friendship and adventures in chess and machine learning). At WebLab97, Michie was actually very, very ahead of time. But despite the presence of IBM’s Java team that could guide us in developments of the toolbox, the technology was not ripe and initiative of WebLab was gone as the conference ended. But, at least for us, the idea sparked interest of Janez and myself, and development of what is now Orange begun shortly after.

Our paper gives brief account of Orange’s history and its developments since WebLab97. For reasons of brevity it does not mention that prior to Qt we have experimented with other GUI platforms. Prior to Qt, we laid our hopes to Pwm Python megawidgets, a library that helped us to construct the first Orange graphical user interface. The GUI part of Orange was called Orange*First. Its screenshot shows a tab for interactive discretisation, thanks to Noriaki Aoki who then proposed that this kind of visualisation should be useful in medical data analysis:

orange first

PS Somehow, I have lost a latex file with a WebLab97 program. It should be on some backup tape, somewhere. The following scan of the first page (and a weblab97.pdf), left in some PPT presentation, is all that I can retrieve. The program of the second day is missing, with keynotes from Tom Mitchell, and much talk about then already a success story of R.

WebLab97.jpg