Tips and tricks for data preparation

Probably the most crucial step in your data analysis is purging and cleaning your data. Here are a couple of cool tricks that will make your data preparation a bit easier.

 

1. Use a smart text editor. We can recommend Sublime Text as it an extremely versatile editor that supports a broad variety of programming languages and markups, but there are other great tools out there as well. One of the best things you’ll keep coming back to in your editor is ‘Replace’ function that allows you to replace specified values with different ones. You can also use regex to easily find and replace parts of text.

editing data with Sublime
We can replace all instances of ‘male’ with ‘man’ in one click.

 

2. Apply simple hacks. Sometimes when converting files to different formats data can get some background information appended that you cannot properly see. A cheap and dirty trick is to manually select the cells and rows and copy-paste them to a new sheet. This will start with a clean slate and you data will be read properly.

 

3. Check your settings. When reading .csv files in Excel, you might see all your data squished in one column and literally separated with commas. This can be easily solved with Data –> From Text (Get external data) and a new window will appear. In a Text Import Wizard you can set whether your data is delimited or not (in our case it is), how it is delimited (comma, tab, etc.), whether you have a header or not, what qualifies as text (” is a recommended option), what is your encoding and so on.

 

4. Manually annotate the data. Orange loves headers and the easiest way to assure your data gets read properly is to set the header yourself. Add two extra rows under your feature names. In the first row, set your variable type and in the second one, your kind. Here’s how to do it properly.

 

5. Exploit the widgets in Orange. Select Columns is your go-to widget for organizing what gets read as a meta attribute, what is your class variable and which features you want to use in your analysis. Another great widget is Edit domain, where you can set the way the values are displayed in the analysis (say you have “grey” in your data, but you want it to say “gray”). Moreover, you can use Concatenate and Merge widgets to put your data together.

setting the domain
Set domain with Edit domain widget.

 

What’s your preferred trick?

Making Predictions

One of the cool things about being a data scientist is being able to predict. That is, predict before we know the actual outcome. I am not talking about verifying your favorite classification algorithm here, and I am not talking about cross-validation or classification accuracies or AUC or anything like that. I am talking about the good old prediction. This is where our very own Predictions widget comes to help.

predictive analytics
Predictions workflow.

 

We will be exploring the Iris data set again, but we’re going to add a little twist to it. Since we’ve worked so much with it already, I’m sure you know all about this data. But now we got three new flowers in the office and of course there’s no label attached to tell us what species of Iris these flowers are. [sigh….] Obviously, we will be measuring petals and sepals and contrasting the results with our data.

predictive analytics
Our new data on three flowers. We have used Google Sheets to enter the data and the copied the sharable link and pasted the link to the File widget.

 

But surely you don’t want to go through all 150 flowers to properly match the three new Irises? So instead, let’s first train a model on the existing data set. We connect the File widget to the chosen classifier (we went with Classification Tree this time) and feed the results into Predictions. Now we write down the measurements for our new flowers into Google Sheets (just like above), load it into Orange with a new File widget and input the fresh data into Predictions. We can observe the predicted class directly in the widget itself.

predictive analytics
Predictions made by classification tree.

 

In the left part of the visualization we have the input data set (our measurements) and in the right part the predictions made with classification tree. By default you see probabilities for all three class values and the predicted class. You can of course use other classifiers as well – it would probably make sense to first evaluate classifiers on the existing data set, find the best one for your and then use it on the new data.

 

Orange YouTube Tutorials

It’s been a long time coming, but finally we’ve created out our first set of YouTube tutorials. In a series ‘Getting Started with Orange’ we will walk through our software step-by-step. You will learn how to create a workflow, load your data in different formats, visualize and explore the data. These tutorials are meant for complete beginners in both Orange and data mining and come with some handy tricks that will make using Orange very easy. Below are the first three videos from this series, more are coming in the following weeks.

 

 

We are also preparing a series called ‘Data Science with Orange’, which will take you on a journey through the world of data mining and machine learning by explaining predictive modeling, classification, regression, model evaluation and much more.

Feel free to let us know what tutorials you’d like to see and we’ll do our best to include it in one of the two series. :)

Color it!

Holiday season is upon us and even the Orange team is in a festive mood. This is why we made a Color widget!

color1

This fascinating artsy widget will allow you to play with your data set in a new and exciting way. No more dull visualizations and default color schemes! Set your own colors the way YOU want it to! Care for some magical cyan-to-magenta? Or do you prefer a more festive red-to-green? How about several shades of gray? Color widget is your go-to stop for all things color (did you notice it’s our only widget with a colorful icon?). :)

Coloring works with most visualization widgets, such as scatter plot, distributions, box plot, mosaic display and linear projection. Set the colors for discrete values and gradients for continuous values in this widget, and the same palletes will be used in all downstream widgets. As a bonus, the Color widget also allows you to edit the names of variables and values.

color6

Remember – the (blue) sky is the limit.

Model-Based Feature Scoring

Feature scoring and ranking can help in understanding the data in supervised settings. Orange includes a number of standard feature scoring procedures one can access in the Rank widget. Moreover, a number of modeling techniques, like linear or logistic regression, can rank features explicitly through assignment of weights. Trained models like random forests have their own methods for feature scoring. Models inferred by these modeling techniques depend on their parameters, like type and level of regularization for logistic regression. Same holds for feature weight: any change of parameters of the modeling techniques would change the resulting feature scores.

It would thus be great if we could observe these changes and compare feature ranking provided by various machine learning methods. For this purpose, the Rank widget recently got a new input channel called scorer. We can attach any learner that can provide feature scores to the input of Rank, and then observe the ranking in the Rank table.

model-scoring-lr

Say, for the famous voting data set (File widget, Browse documentation data sets), the last two feature score columns were obtained by random forest and logistic regression with L1 regularization (C=0.1). Try changing the regularization parameter and type to see changes in feature scores.

rank-voting-lr

Feature weights for logistic and linear regression correspond to the absolute value of coefficients of their linear models. To observe their untransformed values in the table, these widgets now also output a data table with feature weights. (At the time of the writing of this blog, this feature has been implemented for linear regression; other classifiers and regressors that can estimate feature weights will be updated soon).

lr-coefficients

Report is back! (and better than ever)

 

I’m sure you’d agree that reporting your findings when analyzing the data is crucial. Say you have a couple of interesting predictions that you’ve tested with several methods many times and you’d like to share that with the world. Here’s how.

Save Graph just got company – a Report button! Report works in most widgets, apart from the very obvious ones that simply transmit or display the data (Python Scripting, Edit Domain, Image Viewer, Predictions…).

 

Why is Report so great?

 

  1. Display data and graphs used in your workflow. Whatever you do with your data will be put in the report upon a click of a button.

report1

 

2. Write comments below each section in your workflow. Put down whatever matters for your research – pitfalls and advantages of a model, why this methodology works, amazing discoveries, etc.

report2

 

3. Access your workflows. Every step of the analysis recorded in the Report is saved as a workflow and can be accessed by clicking on the Orange icon. Have you spent hours analyzing your data only to find out you made a wrong turn somewhere along the way? No problem. Report saves workflows for each step of the analysis. Perhaps you would like to go back and start again from Bo Plot? Click on the Orange icon next to Box Plot and you will be taken to the workflow you had when you placed that widget in the report. Completely stress-free!

report5

 

4. Save your reports. The amazing new report that you just made can be saved as .html, .pdf or .report file. Html and PDF are pretty standard, but report format is probably the best thing since sliced bread. Why? Not only it saves your report file for later use, you can also send it to your colleagues and they will be able to access both your report and workflows used in the analysis.

5. Open report. To open a saved report file go to File → Open Report. To view the report you’re working on, go to Options → Show report view or click Shift+R.

2UDA

In one of the previous blog posts we mentioned that installing the optional dependency psycopg2 allows Orange to connect to PostgreSQL databases and work directly on the data stored there.
It is also possible to transfer a whole table to the client machine, keep it in the local memory, and continue working with it as with any other Orange data set loaded from a file. But the true power of this feature lies in the ability of Orange to leave the bulk of the data on the server, delegate some of the computations to the database, and transfer only the needed results. This helps especially when the connection is too slow to transfer all the data and when the data is too big to fit in the memory of the local machine, since SQL databases are much better equipped to work with large quantities of data residing on the disk.

If you want to test this feature it is now even easier to do so! A third party distribution called 2UDA provides a single installer for all major OS platforms that combines Orange and a PostgreSQL 9.5 server along with LibreOffice (optional) and installs all the needed dependencies. The database even comes with some sample data sets that can be used to start testing and using Orange out of the box. 2UDA is also a great way to get the very latest version of PostgreSQL, which is important for Orange as it relies heavily on its new TABLESAMPLE clause. It enables time-based sampling of tables, which is used in Orange to get approximate results quickly and allow responsive and interactive work with big data.

We hope this will help us reach an even wider audience and introduce Orange to a whole new group of people managing and storing their data in SQL databases. We believe that having lots of data is a great starting point, but the benefits truly kick in with the ability to easily extract useful information from it.

2UDA

Hierarchical Clustering: A Simple Explanation

One of the key techniques of exploratory data mining is clustering – separating instances into distinct groups based on some measure of similarity. We can estimate the similarity between two data instances through euclidean (pythagorean), manhattan (sum of absolute differences between coordinates) and mahalanobis distance (distance from the mean by standard deviation), or, say, through Pearson correlation or Spearman correlation.

Our main goal when clustering data is to get groups of data instances where:

  • each group (Ci) is a a subset of the training data (U): Ci ⊂ U
  • an intersection of all the sets is an empty set: Ci ∩ Cj = 0
  • a union of all groups equals the train data: Ci ∪ Cj = U

This would be ideal. But we rarely get the data, where separation is so clear. One of the easiest techniques to cluster the data is hierarchical clustering. First, we take an instance from, say, 2D plot. Now we want to find its nearest neighbor. Nearest neighbor of course depends on the measure of distance we choose, but let’s go with euclidean for now as it is the easiest to visualize.

hier-clust-blog-compare1
First steps of hierarchical clustering.

 

Euclidean distance is calculated as:

Naturally, the shorter the distance the more similar the two instances are. In the beginning, all instances are in their own particular clusters. Then we seek for the closest instances of every instance in the plot. We pin down the closest instance and make a cluster of the original and the closest instance. Now we repeat the process again. What is the closest instances to our new cluster –> add it to the cluster –> find the closest instance. We repeat this procedure until all the instances are grouped in one single cluster.

We can write this down also in a form of a pseudocode:

every instance is in its own cluster

repeat until instances are all in one group:

    find the closest instances to the group (distance has to be minimum)

    join closest instances with the group

hier-clust-blog6

 

Visualization of this procedure is called a dendrogram, which is what Hierarchical clustering widget displays in Orange.

Single, complete and average linkage.

 

Another thing to consider is the distance between instances when we have already two or more instances in a cluster. Do we go with the closest instance in a cluster or to the furthest one?

  • Picture A shows the distances to the closest instance – single linkage.
  • Picture B shows the distance to the furthest instance – complete linkage.
  • Picture C shows the average of all distances in a cluster to the instance – average linkage.

 

single-vs-complete
Single vs complete linkage.

 

The downside of single linkage is, even by intuition, creating elongated, stretched clusters. Instances at the top part of the red C are in fact quite different from the lower part of the red C. Complete linkage does much better here as it centers clustering nicely. However, the downside of complete linkage is taking outliers too much into consideration. Naturally, each approach has its own pros and cons and it’s good to know how they work in order to use them correctly. One extra hint: single linkage works great for image recognition, exactly because it can follow the curve.

There’s a lot more we could say about hierarchical clustering, but to sum it up, let’s state pros and cons of this method:

  • pros: sums up the data, good for small data sets
  • cons: computationally demanding, fails on larger sets

Mining our own data

Recently we’ve made a short survey that was, upon Orange download, asking people how they found out about Orange, what was their data mining level and where do they work. The main purpose of this is to get a better insight into our user base and to figure out what is the profile of people interested in trying Orange.

Here we have some preliminary results that we’ve managed to gather in the past three weeks or so. Obviously we will use Orange to help us make sense of the data.

 

We’ve downloaded our data from Typeform and appended some background information such as OS and browser. Let’s see what we’ve got in the Data Table widget.

blog-results7

 

Ok, this is our entire data table. Here we also have the data on people who completed the survey and who didn’t. First, let’s organize the data properly. We’ll do this with Select Columns widget.

blog-results

 

We removed all the meta attributes as they are not very relevant for our analysis. Next we moved the ‘completed’ attribute into target variable, thus making it our class variable.

blog-results2

 

Now we would like to see some basic distributions from our data.

blog-results3

 

Interesting. Most of our users are working on Windows, a few on Mac and very few on Linux.

Let’s investigate further. Now we want to know more about those people who actually completed the survey. Let’s use Select Columns again, this time removing os_type, os_name, agent_name and completed from our data and keeping just the answers. We made “Where do you work?” our class variable, but we could use either one of the three. Another trick is to set in directly in Distributions widget under ‘Group by’.

blog-results4

 

Ok, let’s again use Distributions – this is such a simple way to get a good sense of your data.

blog-results5

 

Obviously out of those who found out about Orange in college, most are students, but what’s interesting here is that there are so many. We can also see that out of those who found us on the web, most come from the private sector, followed by academia and researchers. Good. How about the other question?

blog-results6

 

Again, results are not particularly shocking, but it’s great to confirm your hypothesis with real data. Out of beginner level data miners, most are students, while most intermediate users come from the industry.

A quick look at the Mosaic Display will give us a good overview:

blog-results8

 

Yup, this sums it up quite nicely. We have lots of beginner levels users and not many expert ones (height of the box). Also most people found out about Orange on the web or in college (width of the box). A thin line on the left shows apriori distribution, thus making it easier to compare expected and actual number of instances. For example, there should be at least some people who are students and have found out about Orange at a conference. But there aren’t – a contrast between how much red there should be in the box (line on the left) and how much there actually is (bigger part of the box) is quite telling. We can even select all the beginner level users who found out about Orange in college and further inspect the data, but be it enough for now.

Our final workflow:

 

blog-results12

 

Obviously, this is a very simple analysis. But even such simple tasks are never boring with good visualization tools such as Distributions and Mosaic Display. You could also use Venn Diagram to find common features of selected subsets or perhaps Sieve Diagram for probabilities.

 

We are very happy to get these data and we would like to thank everyone who completed the survey. If you wish to help us further, please fill out a longer survey that won’t actually take you more than 3 minutes of your time (we timed it!).

 

Happy Friday everyone!

Ghostbusters

Ok, we’ve just recently stumbled across an interesting article on how to deal with non normal (non-Gaussian distributed) data.
paranormal1

We have an absolutely paranormal data set of 20 persons with weight, height, paleness, vengefulness, habitation and age attributes (download).

paranormal2

Let’s check the distribution in Distributions widget.

paranormal3

Our first attribute is “Weight” and we see a little hump on the left. Otherwise the data would be normally distributed. Ok, so perhaps we have a few children in the data set. Let’s check the age distribution.
paranormal4

Whoa, what? Why is the hump now on the right? These distributions look scary. We seem to have a few reaaaaally old people here. What is going on? Perhaps we can figure this out with MDS. This widget projects the data into two dimensions so that the distances between the points correspond to differences between the data instances.

paranormal5

Aha! Now we see that three instances are quite different from all others. Select them and send them to the Data Table for final inspection.

paranormal6

Busted! We have found three ghosts hiding in our data. They are extremely light (the sheet they are wearing must weight around 2kg), quite vengeful and old.

Now, joke aside, what would this mean for a general non-normally distributed data? One thing is your data set might be too small. Here we only have 20 instances, thus 3 outlying ghosts have a great impact on the distribution. It is difficult to hide 3 ghosts among 17 normal persons.

Secondly, why can’t we use Outliers widget to hunt for those ghosts? Again, our data set is too small. With just 20 instances, the estimation variance is so large that it can easily cover a few ghosts under its sheet. We don’t have enough “normal” data to define what is normal and thus detect the paranormal.

Haven’t we just written two exactly opposite things? Perhaps.

Happy Halloween everybody! :)

spooky-orange