Scatter Plots: the Tour

Scatter plots are surely one of the best loved visualizations in Orange. Very often, when we teach, people go back to scatter plots over and over again to see their data. We took people’s love for scatter plots to the heart and we redesigned them a bit to make them even more friendly.

Our favorite still remains the Informative Projections button. This button helps you find interesting visualizations from all the combinations of your data variables. But what does interesting mean? Well, let us look at an example. Which of the two visualizations tells you more about the data?

We’d say it is the right one. Why? Because now we know that the combination of petal length and petal width nicely separates the classes!

Of course, Informative Projections button will only work when you have set a class (target) variable.

In scatter plot, you can set also the color of the data points (class is selected by default), the size of the points and the shape. This means you can add three new layers of information to your data, but we warn you not to overuse them. This usually looks very incomprehensible, even though it packs a lot of information.

You might notice, that in the current version of Orange, you can no longer select discrete attributes in Scatter Plot. This is entirely intentional. Scatter plots are best at showing the relationship between two numeric variables, such as in the two examples above. Categorical variables are much better represented with Box Plots, histograms (in Distributions) or in Mosaic Display.

   

Above, we have presented the same information for titanic data set in different visualizations, that are particularly suitable for categorical variables.

Scatter plot also enables so cool tricks. Just like in most visualizations in Orange, I can select a part of the data and observe the subset downstream. Or the other way around. I have a particular subset I wish to observe and I can pass it to Scatter Plot widget, which will highlight selected data instances.

This is also true for all other point-based visualizations in Orange, namely t-SNE, MDS, Radviz, Freeviz, and Linear Projection.

You can see there are many great thing you can do with Scatter Plot. Finally, we have added a nice touch to the visualization.

Yes, setting the size of the attribute is now animated! 🙂

Happy holidays, everyone!

Scatter Plot Projection Rank

One of the nicest and surely most useful visualization widgets in Orange is Scatter Plot. The widget displays a 2-D plot, where x and y-axes are two attributes from the data.

2-dimensional scatter plot visualization
2-dimensional scatter plot visualization

 

Orange 2.7 has a wonderful functionality called VizRank, that is now implemented also in Orange 3. Rank Projections functionality enables you to find interesting attribute pairs by scoring their average classification accuracy. Click ‘Start Evaluation’ to begin ranking.

Rank Projections before ranking is performed.
Rank Projections before ranking is performed.

 

The functionality will also instantly adapt the visualization to the best scored pair. Select other pairs from the list to compare visualizations.

Rank Projections once the attribute pairs are scored.
Rank Projections once the attribute pairs are scored.

 

Rank suggested petal length and petal width as the best pair and indeed, the visualization below is much clearer (better separated).

Scatter Plot once the visualization is optimized.
Scatter Plot once the visualization is optimized.

 

Have fun trying out this and other visualization widgets!

Visualizing Misclassifications

In data mining classification is one of the key methods for making predictions and gaining important information from our data. We would, for example, use classification for predicting which patients are likely to have the disease based on a given set of symptoms.

In Orange an easy way to classify your data is to select several classification widgets (e.g. Naive Bayes, Classification Tree and Linear Regression), compare the prediction quality of each learner with Test Learners and Confusion Matrix and then use the best performing classifier on a new data set for classification. Below we use Iris data set for simplicity, but the same procedure works just as well on all kinds of data sets.

Here we have three confusion matrices for Naive Bayes (top), Classification Tree (middle) and Logistic Regression (bottom).

 

Three misclassification matrices (Naive Bayes, Classification Tree and Logistic Regression)
Three misclassification matrices (Naive Bayes, Classification Tree and Logistic Regression)

 

We see that Classification Tree did the best with only 9 misclassified instances. To see which instances were assigned a false class, we select ‘Misclassified’ option in the widget, which highlights misclassifications and feeds them to the Scatter Plot widget. In the graph we thus see the entire data set presented with empty dots and the selected misclassifications with full dots.

Visualization of misclassified instances in scatter plot.
Visualization of misclassified instances in scatter plot.

 

Feel free to switch between learners in Confusion Matrix to see how the visualization changes for each of them.