Orange Canvas applied to x-ray optics

Orange Canvas is being appropriated by guys who would like to use it as graphical environment for simulating x-ray optics.

Manuel Sanchez del Rio, from The European Synchrotron Facility in Grenoble, France, and Luca Rebuffi from Elettra-Sincrotrone, Trieste, Italy, were looking for a tool that would help them integrate the various tools for x-ray optics simulations, like the popular SHADOW and SRW. They discovered that the data workflow paradigm, like the one used in Orange Canvas, fits their needs perfectly. They took Orange, and replaced the existing widgets with new widgets that represent sources of photons (bending magnets, in the case of ESRF), various optical elements, like lenses and mirrors, and detectors. The channels between the widgets no longer pass data tables, like in the standard Orange, but rays of photons. How cool is this?

The result is a system in which the user can arrange the elements in a system that resembles the actual physical system, and then run the simulations using the most powerful tools available in x-ray optics.

The tool prototype has been presented at the SPIE Optics + Photonic 2014 in San Diego, the largest meeting of its kind.

We’re really excited about this novel use of Orange Canvas.

spie.jpg

Orange and SQL

Orange 3.0 will also support working with data stored in a database.

While we have already talked about this some time ago, we here describe some technical details for anybody interested. This is not a thorough tecnical report, its purpose is only to provide an impression about the architecture of the upcoming version of Orange.

So, data tables in Orange 3.0 can refer to data in the working memory or in the database. Any (properly written) code that uses tables should work the same with both storages. When the data is stored in the database, the table is implemented as a “proxy object” with the necessary meta-data for constructing the SQL query to retrieve the data when needed. Operations on the data only modify the meta-data without retrieving any actual data. For instance, construction of a new table with some selected data subset, say all instances that match a certain condition, creates a new proxy with additional conditions for the WHERE clause. Similarly, selecting a subset of features only changes the domain (the list of features), which is later reflected in the columns of the SELECT clause.

Features in this model are no longer described just with their names but also with the part which goes into the query that retrieves or constructs their values. Discretization, for instance, constructs new features which wrap the representation of the continuous features into a CASE statement that assigns a value based on the boundaries of the bins.

Since the goal was to make the code in modules and widgets oblivious to the storage, we also needed separate implementation of the operations that need to be aware of how the data is stored. For instance, the code that computes the average values of attributes needs to be different for the two storages: for the in-memory data we need to use the corresponding numpy functions and for databases the average is computed on the server.

We went through the code of Orange 2.7 and identified the common operations on the data. We found that all data access belongs into the following types:

  1. basic aggregates like mean, variance, median, minimal and maximal value,
  2. distributions of discrete and continuous variables, values at percentiles,
  3. contingency matrices,
  4. covariance matrices,
  5. filtering of rows based on various criteria, including random sampling,
  6. selection of columns,
  7. construction of variables from values of other variables,
  8. matrices of distances (e.g. Euclidean) between all row pairs,
  9. individual data rows.

Points 1 to 4 are typical examples of what cannot be done on client but can be efficiently done in the database. The storage (a class derived from Table) now provides specialized methods for computing aggregates, distributions and contingencies, which use numpy for in-memory data and SQL for the data on the database.

Points 5 to 7 are implemented “lazily”, by modifying the SQL query describing the data as described above.

Point 8 is difficult to implement efficiently in common relational databases and, besides, results in a data matrix that is larger than the actual data. Methods that require such a matrix will need to be reimplemented and be aware of the storage mechanism.

Point 9 requires some caution with regard to how the data is retrieved and what it is used for. Access to individual rows should be used sparingly. Sequential retrieval – especially of all rows – needs to be avoided. For efficiency, most methods that did so in the previous versions of Orange will need to be reimplemented to use aggregate data (possibly as approximations) or to be aware of the data storage and execute some operations directly through SQL.

We have already ported a number of visualizations and other widgets to the new Orange. Here is one nice example: Mosaic needs to discretize the variables and then compute contingency matrices for discrete variables. Within the above scheme, the widget does not care about the storage mechanism, yet its computation is still as efficient as possible.

mosaic.png

The described activities were funded in part by the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.

Workshops at Baylor College of Medicine

On May 22nd and May 23rd, we (Blaz Zupan and Janez Demsar, assisted by Marinka Zitnik and Balaji Santhanam) have given two hands-on workshops called Data Mining without Programming at Baylor College of Medicine in Houston, Texas.

Actually, there was a lot of programming, but no Python or alike. The workshop was designed for biomedical students and Baylor’s faculty members. We have presented a visual programming approach for development of data mining workflows for interactive data exploration. A three-hour workshop consisted of 15 data mining lessons on visual data exploration, classification, clustering, network analysis, and gene expression analytics. Each lesson focused on a particular data analysis task that the attendees solved with Orange.

The two workshops were organized by Baylor’s Computational and Integrative Biomedical Research Center. Over two days, the event was attended by a large audience of 120 attendees.

workshop-a.jpg

workshop-b.jpg

Viewing Images

I am lately having fun with Image Viewer. The widget has been recently updated and can display images stored locally or on the internet. But wait, what images? How on earth can Orange now display images if it can handle mere tabular or basket-based data?

Here’s an example. I have considered a subset of animals from the zoo.csv data set (comes with Orange installation), and for demonstration purposes selected only a handful of attributes. I have added a new string attribute (“images”) and declared that this is a meta attribute of the type “image”. The values of this attribute are links to images on the web:

animals-dataset.png

Here is the resulting data set, zoo-with-images.csv. I have used this data set in a schema with hierarchical clustering, where upon selection of the part of the clustering tree I can display the associated images:

animals-schema.png

Typically and just like above, you would use a string meta attribute to store the link to images. Images can be referred to using a HTTP address, or, if stored locally, using a relative path from the data file location to the image files.

Here is another example, where all the images were local and we have associated them with a famous digits data set ( digits.zip is a data set in the Orange format with the image files). The task for this data set is to classify handwritten digits based on their bitmap representation. In the schema below we wanted to find out which are the most frequent errors some classification algorithm would make, and how do the images of the misclassified digits look like. Turns out that SVM with RBF kernel most often misclassify the digit 9 and confuses it with a digit 3:

digits-schema.png

Paint Your Data

One of the widgets I enjoy very much when teaching introductory course in data mining is the Paint Data widget. When painting in this widget I would intentionally include some clusters, or intentionally obscure them. Or draw them in any strange shape. Then I would discuss with students if these clusters are identified by k-means clustering or by hierarchical clustering. We would also discuss automatic scoring of the quality of clusters, come up with the idea of a silhouette (ok, already invented, but helps if you get this idea on your own as well). And then we would play with various data sets and clustering techniques and their parameters in Orange.

Like in the following workflow where I drew three clusters which were indeed recognized by k-means clustering. Notice that silhouette scoring correctly identified even the number of clusters. And I also drew the clustered data in the Scatterplot to check if the clusters are indeed where they should be.

PaintData-k-Means-ok.png

Or like in the workflow below where k-means fails miserably (but someother clustering technique would not).

PaintData-k-Means-notok.png

Paint Data can also be used in supervised setting, for classification tasks. We can set the intended number of classes, and then chose any of these to paint the data. Below I have used it to create the datasets to check the behavior of several classifiers.

PaintData-Supervised.png

There are tons of other workflows where Paint Data can be useful. Give it a try!

Orange and AXLE project

Our group at University of Ljubljana is a partner in the EU 7FP project Advanced Analytics for Extremely Large European Databases (AXLE). The project is particularly interesting because of the diverse partners that cover the entire vertical, from studying hardware architectures that would better support extremely large databases (University of Manchester, Barcelona Supercomputing Center) to making the necessary adjustments related to speed and security of databases (2ndQuadrant) to data analytics (our group) to handling and analyzing real data and decision making (Portavita).

As a result of the project, Orange will be better connected with databases. Currently, all data is stored in working memory, while the forthcoming Orange 3.0 will be able to handle data that is stored in the database. We are working on a parallel computation architecture. Visualization of large data also presents a big challenge: we cannot transfer large amounts of data from the database to the desktop, and on the other hand it is difficult to provide a rich interactive experience if visualizations are created on the server-side. Also, most visualizations are intrinsically unsuitable for large data sets. For instance, the scatter plot represents each data instance with a symbol. Even when the datum is represented with a single pixel, only a few million data points fits on the computer screen. So in the context of big data, we will have to replace scatterplots with heatmaps.

What have we got so far? Orange 3, which is in early stage of development, features a new architecture, which allows the data to be stored either in memory or on a database. In the latter case, selecting a subset of features or filtering the data does not copy the data but only modifies the queries that are used to access the data when needed. Computation of, for instance, distributions or contingency matrices is performed on the server, so only the minimal amount of data is transferred to the client.

We also already have a small suite of widgets that work with this new architecture. Just to wet your appetite, here is the new box plot widget.

BoxPlot-Orange30.png

Network Add-on Published in JSS

NetExplorer, a widget for network exploration, was in orange for over 5 years. Several network analysis widgets were added to Orange since, and we decided to move the entire network functionality to an Orange Network add-on.

We recently published a paper Interactive Network Exploration with Orange in the Journal of Statistical Software. We invite you to read the tutorial on network exploration. It is aimed for beginners in this topic, and includes detailed explanation with images.

NetAddon.pngNetExplorer

Problems With Orange Website

Our servers crashed on Friday, March 1st due to technical problems. The Orange website was offline for several hours and Mac bundle was unaccessible until today.

We are still reviewing if our other services work. If you notice some problems, please ping us.

Stay tuned and fruitful downloading!

New canvas

Orange Canvas, a visual programming environment for Orange, has been around for a while. Integrating new and new features degraded the quality of code to a point where further development proved to be a daunting task. With ever increasing number of widgets, the existing widget toolbar is becoming harder and harder to use, but improving it is really hard. For that reason, we decided Orange needs a new Canvas, a rewrite, that would keep all of the feature of the existing one, but introduce the needed structure and modularity to the source code.

The project started about a year ago, and more than 20 thousand lines of code later, we have something to show you. As of yesterday, the new canvas was merged to the main Orange repository, where it lives alongside the old one. At the moment, it still lacks a lot of testing, some features are not completely implemented, but the main functionality, i.e. visual programming with widgets and links, should work.

New canvas

If you are feeling adventurous, you can try it out yourself. Download the latest version from our website and run:

Windows:

C:\Python27\python.exe -m Orange.OrangeCanvas.main

Mac OS X bundle:

/Applications/Orange.app/Contents/MacOS/python -m Orange.OrangeCanvas.main

or, regardless of your operating system,

python -m Orange.OrangeCanvas.main

with the python that has Orange installed.

What to expect?

Nothing will explode, but short of that, anything might happen. If you stumble upon issues or have helpful suggestions, please post them on our issue tracker. There are some known problems we are aware of; you do not need to report those :).

Orange NMF add-on

Nimfa, a Python library for non-negative matrix factorization (NMF), which was part of Orange GSoC program back in 2011 got its own add-on.

Nimfa provides a plethora of initialization and factorization algorithms, quality measures along with examples on real-world and synthetic data sets. However, until now the analysis was possible only through Python scripting. A recent increase of interest in NMF techniques motivated Fajwel Fogel (a PhD student from INRIA, Paris, SIERRA team) to design and implement several widgets that deal with missing data in target matrices, their normalizations, viewing and assessing the quality of matrix factors returned by different matrix factorization algorithms. He also provided an implementation of robust singular value decomposition (rSVD). All NMF methods call Nimfa library.

Target, basis and coefficient matrices.

Above is shown a simple scenario in Orange that applies LSNMF algorithm from Nimfa to decompose a non-negative target matrix and visualizes its basis matrix (W) and coefficient matrix (H) as heat maps. NMF finds a parts-based representation of the data due to the fact that only additive, not subtractive, combinations are allowed, which results in improved interpretability of matrix factors. That is possible because non-negativity constraints are imposed in the NMF model in contrast to SVD, PCA and ICA, which provide only holistic representations. The effect can be easily seen if we investigate heat maps produced by the scenario above. Below are shown the target, basis and coefficient matrices (from left to right, top down), respectively.

NMF Add-on view in Orange