Tips and Tricks for Data Preparation

Probably the most crucial step in your data analysis is purging and cleaning your data. Here are a couple of cool tricks that will make your data preparation a bit easier.

 

1. Use a smart text editor. We can recommend Sublime Text as it an extremely versatile editor that supports a broad variety of programming languages and markups, but there are other great tools out there as well. One of the best things you’ll keep coming back to in your editor is ‘Replace’ function that allows you to replace specified values with different ones. You can also use regex to easily find and replace parts of text.

editing data with Sublime
We can replace all instances of ‘male’ with ‘man’ in one click.

 

2. Apply simple hacks. Sometimes when converting files to different formats data can get some background information appended that you cannot properly see. A cheap and dirty trick is to manually select the cells and rows and copy-paste them to a new sheet. This will start with a clean slate and you data will be read properly.

 

3. Check your settings. When reading .csv files in Excel, you might see all your data squished in one column and literally separated with commas. This can be easily solved with Data –> From Text (Get external data) and a new window will appear. In a Text Import Wizard you can set whether your data is delimited or not (in our case it is), how it is delimited (comma, tab, etc.), whether you have a header or not, what qualifies as text (” is a recommended option), what is your encoding and so on.

 

4. Manually annotate the data. Orange loves headers and the easiest way to assure your data gets read properly is to set the header yourself. Add two extra rows under your feature names. In the first row, set your variable type and in the second one, your kind. Here’s how to do it properly.

 

5. Exploit the widgets in Orange. Select Columns is your go-to widget for organizing what gets read as a meta attribute, what is your class variable and which features you want to use in your analysis. Another great widget is Edit domain, where you can set the way the values are displayed in the analysis (say you have “grey” in your data, but you want it to say “gray”). Moreover, you can use Concatenate and Merge widgets to put your data together.

setting the domain
Set domain with Edit domain widget.

 

What’s your preferred trick?

Making Predictions

One of the cool things about being a data scientist is being able to predict. That is, predict before we know the actual outcome. I am not talking about verifying your favorite classification algorithm here, and I am not talking about cross-validation or classification accuracies or AUC or anything like that. I am talking about the good old prediction. This is where our very own Predictions widget comes to help.

predictive analytics
Predictions workflow.

 

We will be exploring the Iris data set again, but we’re going to add a little twist to it. Since we’ve worked so much with it already, I’m sure you know all about this data. But now we got three new flowers in the office and of course there’s no label attached to tell us what species of Iris these flowers are. [sigh….] Obviously, we will be measuring petals and sepals and contrasting the results with our data.

predictive analytics
Our new data on three flowers. We have used Google Sheets to enter the data and the copied the sharable link and pasted the link to the File widget.

 

But surely you don’t want to go through all 150 flowers to properly match the three new Irises? So instead, let’s first train a model on the existing data set. We connect the File widget to the chosen classifier (we went with Classification Tree this time) and feed the results into Predictions. Now we write down the measurements for our new flowers into Google Sheets (just like above), load it into Orange with a new File widget and input the fresh data into Predictions. We can observe the predicted class directly in the widget itself.

predictive analytics
Predictions made by classification tree.

 

In the left part of the visualization we have the input data set (our measurements) and in the right part the predictions made with classification tree. By default you see probabilities for all three class values and the predicted class. You can of course use other classifiers as well – it would probably make sense to first evaluate classifiers on the existing data set, find the best one for your and then use it on the new data.

 

Orange YouTube Tutorials

It’s been a long time coming, but finally we’ve created out our first set of YouTube tutorials. In a series ‘Getting Started with Orange’ we will walk through our software step-by-step. You will learn how to create a workflow, load your data in different formats, visualize and explore the data. These tutorials are meant for complete beginners in both Orange and data mining and come with some handy tricks that will make using Orange very easy. Below are the first three videos from this series, more are coming in the following weeks.

 

 

We are also preparing a series called ‘Data Science with Orange’, which will take you on a journey through the world of data mining and machine learning by explaining predictive modeling, classification, regression, model evaluation and much more.

Feel free to let us know what tutorials you’d like to see and we’ll do our best to include it in one of the two series. 🙂