Holiday season is upon us and even the Orange team is in a festive mood. This is why we made a Color widget!
This fascinating artsy widget will allow you to play with your data set in a new and exciting way. No more dull visualizations and default color schemes! Set your own colors the way YOU want it to! Care for some magical cyan-to-magenta? Or do you prefer a more festive red-to-green? How about several shades of gray? Color widget is your go-to stop for all things color (did you notice it’s our only widget with a colorful icon?). 🙂
Coloring works with most visualization widgets, such as scatter plot, distributions, box plot, mosaic display and linear projection. Set the colors for discrete values and gradients for continuous values in this widget, and the same palletes will be used in all downstream widgets. As a bonus, the Color widget also allows you to edit the names of variables and values.
Feature scoring and ranking can help in understanding the data in supervised settings. Orange includes a number of standard feature scoring procedures one can access in the Rank widget. Moreover, a number of modeling techniques, like linear or logistic regression, can rank features explicitly through assignment of weights. Trained models like random forests have their own methods for feature scoring. Models inferred by these modeling techniques depend on their parameters, like type and level of regularization for logistic regression. Same holds for feature weight: any change of parameters of the modeling techniques would change the resulting feature scores.
It would thus be great if we could observe these changes and compare feature ranking provided by various machine learning methods. For this purpose, the Rank widget recently got a new input channel called scorer. We can attach any learner that can provide feature scores to the input of Rank, and then observe the ranking in the Rank table.
Say, for the famous voting data set (File widget, Browse documentation data sets), the last two feature score columns were obtained by random forest and logistic regression with L1 regularization (C=0.1). Try changing the regularization parameter and type to see changes in feature scores.
Feature weights for logistic and linear regression correspond to the absolute value of coefficients of their linear models. To observe their untransformed values in the table, these widgets now also output a data table with feature weights. (At the time of the writing of this blog, this feature has been implemented for linear regression; other classifiers and regressors that can estimate feature weights will be updated soon).
I’m sure you’d agree that reporting your findings when analyzing the data is crucial. Say you have a couple of interesting predictions that you’ve tested with several methods many times and you’d like to share that with the world. Here’s how.
Save Graph just got company – a Report button! Report works in most widgets, apart from the very obvious ones that simply transmit or display the data (Python Scripting, Edit Domain, Image Viewer, Predictions…).
Why is Report so great?
Display data and graphs used in your workflow. Whatever you do with your data will be put in the report upon a click of a button.
2. Write comments below each section in your workflow. Put down whatever matters for your research – pitfalls and advantages of a model, why this methodology works, amazing discoveries, etc.
3. Access your workflows. Every step of the analysis recorded in the Report is saved as a workflow and can be accessed by clicking on the Orange icon. Have you spent hours analyzing your data only to find out you made a wrong turn somewhere along the way? No problem. Report saves workflows for each step of the analysis. Perhaps you would like to go back and start again from Bo Plot? Click on the Orange icon next to Box Plot and you will be taken to the workflow you had when you placed that widget in the report. Completely stress-free!
4. Save your reports. The amazing new report that you just made can be saved as .html, .pdf or .report file. Html and PDF are pretty standard, but report format is probably the best thing since sliced bread. Why? Not only it saves your report file for later use, you can also send it to your colleagues and they will be able to access both your report and workflows used in the analysis.
5. Open report. To open a saved report file go to File → Open Report. To view the report you’re working on, go to Options → Show report view or click Shift+R.
In one of the previous blog posts we mentioned that installing the optional dependency psycopg2 allows Orange to connect to PostgreSQL databases and work directly on the data stored there.
It is also possible to transfer a whole table to the client machine, keep it in the local memory, and continue working with it as with any other Orange data set loaded from a file. But the true power of this feature lies in the ability of Orange to leave the bulk of the data on the server, delegate some of the computations to the database, and transfer only the needed results. This helps especially when the connection is too slow to transfer all the data and when the data is too big to fit in the memory of the local machine, since SQL databases are much better equipped to work with large quantities of data residing on the disk.
If you want to test this feature it is now even easier to do so! A third party distribution called 2UDA provides a single installer for all major OS platforms that combines Orange and a PostgreSQL 9.5 server along with LibreOffice (optional) and installs all the needed dependencies. The database even comes with some sample data sets that can be used to start testing and using Orange out of the box. 2UDA is also a great way to get the very latest version of PostgreSQL, which is important for Orange as it relies heavily on its new TABLESAMPLE clause. It enables time-based sampling of tables, which is used in Orange to get approximate results quickly and allow responsive and interactive work with big data.
We hope this will help us reach an even wider audience and introduce Orange to a whole new group of people managing and storing their data in SQL databases. We believe that having lots of data is a great starting point, but the benefits truly kick in with the ability to easily extract useful information from it.
One of the key techniques of exploratory data mining is clustering – separating instances into distinct groups based on some measure of similarity. We can estimate the similarity between two data instances through euclidean (pythagorean), manhattan (sum of absolute differences between coordinates) and mahalanobis distance (distance from the mean by standard deviation), or, say, through Pearson correlation or Spearman correlation.
Our main goal when clustering data is to get groups of data instances where:
each group (Ci) is a a subset of the training data (U): Ci ⊂ U
an intersection of all the sets is an empty set: Ci ∩ Cj = 0
a union of all groups equals the train data: Ci ∪ Cj = U
This would be ideal. But we rarely get the data, where separation is so clear. One of the easiest techniques to cluster the data is hierarchical clustering. First, we take an instance from, say, 2D plot. Now we want to find its nearest neighbor. Nearest neighbor of course depends on the measure of distance we choose, but let’s go with euclidean for now as it is the easiest to visualize.
Euclidean distance is calculated as:
Naturally, the shorter the distance the more similar the two instances are. In the beginning, all instances are in their own particular clusters. Then we seek for the closest instances of every instance in the plot. We pin down the closest instance and make a cluster of the original and the closest instance. Now we repeat the process again. What is the closest instances to our new cluster –> add it to the cluster –> find the closest instance. We repeat this procedure until all the instances are grouped in one single cluster.
We can write this down also in a form of a pseudocode:
every instance is in its own cluster
repeat until instances are all in one group:
find the closest instances to the group (distance has to be minimum)
join closest instances with the group
Visualization of this procedure is called a dendrogram, which is what Hierarchical clustering widget displays in Orange.
Another thing to consider is the distance between instances when we have already two or more instances in a cluster. Do we go with the closest instance in a cluster or to the furthest one?
Picture A shows the distances to the closest instance – single linkage.
Picture B shows the distance to the furthest instance – complete linkage.
Picture C shows the average of all distances in a cluster to the instance – average linkage.
The downside of single linkage is, even by intuition, creating elongated, stretched clusters. Instances at the top part of the red C are in fact quite different from the lower part of the red C. Complete linkage does much better here as it centers clustering nicely. However, the downside of complete linkage is taking outliers too much into consideration. Naturally, each approach has its own pros and cons and it’s good to know how they work in order to use them correctly. One extra hint: single linkage works great for image recognition, exactly because it can follow the curve.
There’s a lot more we could say about hierarchical clustering, but to sum it up, let’s state pros and cons of this method:
pros: sums up the data, good for small data sets
cons: computationally demanding, fails on larger sets