Unfreezing Orange

Have you ever tried Orange with data big enough that some widgets ran for more than a second? Then you have seen it: Orange froze. While the widget was processing, the interface would not respond to any inputs, and there was no way to stop that widget.

Not all the widgets freeze, though! Some widgets, like Test & Score, k-Means, or Image Embedding, do not block. While they are working, we are free to build other parts of the workflow, and these widgets also show their progress. Some, like Image Embedding, which work with lots of images, even allow interruptions.

Why does Orange freeze? Most widgets process users’ actions directly: after an event (click, pressed key, new input data) some code starts running: until it finishes, the interface can not respond to any new events. This is a reasonable approach for short tasks, such as making a selection in a Scatter Plot. But with longer tasks, such as building a Support Vector Model on big data, Orange gets unresponsive.

To make Orange responsive while it is processing, we need to start the task in a new thread. As programmers we have to consider the following:
1. Starting the task. We have to make sure that other (older) tasks are not running.
2. Showing results when the task has finished.
3. Periodic communication between the task and the interface for status reports (progress bars) and task stopping.

Starting the task and showing the results are straightforward and well documented in a tutorial for writing widgets. Periodic communication with stopping is harder: it is completely task-dependent and can be either trivial, hard, or even impossible. Periodic communication is, in principle, unessential for responsiveness, but if we do not implement it, we will be unable to stop the running task and progress bars would not work either.

Taking care of periodic communication was the hardest part of making the Neural Network widget responsive. It would have been easy, had we implemented neural networks ourselves. But we use the scikit-learn implementation, which does not expose an option to make additional function calls while fitting the network (we need to run code that communicates with the interface). We had to resort to a trick: we modified fitting so that a change to an attribute called n_iters_ called a function (see pull request). Not the cleanest solution, but it seems to work.

For now, only a few widgets work so that the interface remains responsive. We are still searching for the best way to make existing widgets behave nicely, but responsiveness is now one of our priorities.

Parallel Orange?

We attended a NIPS 2011 workshop on processing and learning from large scale data. Various presenters showed different tools and frameworks that can be used when developing algorithms suitable for dealing with large scale data, but none of them were written in Python and as such, not useful for Orange. We have been looking for a framework that would help us run code in parallel for some time, but so far with no luck.

We would like to have a framework that is easy to use, can be used in C as well as in Python and supports multi-level map reduce (cross validation can be viewed as map reduce and random forest that is tested is another map-reduce). Prototypes we have created so far solve this problem by inspecting learners that are used in cross-validation and creating all “subtasks” at the same time. That results in really ugly code we don’t want to commit ;). If you know a framework that would suit our needs, want to implement support for parallel computation by yourself (we will apply to GSoC) or have an idea how to solve this problem, feel free to contact us ;).