Data Preparation for Machine Learning

We’ve said it numerous times and we’re going to say it again. Data preparation is crucial for any data analysis. If your data is messy, there’s no way you can make sense of it, let alone a computer. Computers are great at handling large, even enormous data sets, speedy computing and recognizing patterns. But they fail miserably if you give them the wrong input. Also some classification methods work better with binary values, other with continuous, so it is important to know how to treat your data properly.

Orange is well equipped for such tasks.

 

Widget no. 1: Preprocess

Preprocess is there to handle a big share of your preprocessing tasks.

 

Original data.

 

  • It can normalize numerical data variables. Say we have a fictional data set of people employed in your company. We want to know which employees are more likely to go on holiday, based on the yearly income, years employed in your company and total years of experience in the industry. If you plot this in heat map, you would see a bold yellow line at ‘yearly income’. This obviously happens because yearly income has much higher values than years of experience or years employed by your company. You would naturally like the wage not to overweight the rest of the feature set, so normalization is the way to go. Normalization will transform your values to relative terms, that is, say (depending on the type of normalization) on a scale from 0 to 1. Now Heat Map neatly shows that people who’ve been employed longer and have a higher wage more often go on holidays. (Yes, this is a totally fictional data set, but you see the point.)

heatmap1
  no normalization

heatmap2
   normalized data

 

  • It can impute missing values. Average or most frequent missing value imputation might seem as overly simple, but it actually works most of the time. Also, all the learners that require imputation do it implicitly, so the user doesn’t have to configure yet another widget for that.
  • If you want to compare your results against a randomly mixed data set, select ‘Randomize’ or if you want to select relevant features, this is the widget for it.

Preprocessing needs to be used with caution and understanding of your data to avoid losing important information or, worse, overfitting the model. A good example is a case of paramedics, who usually don’t record pulse if it is normal. Missing values here thus cannot be imputed by an average value or random number, but as a distinct value (normal pulse). Domain knowledge is always crucial for data preparation.

 

Widget no. 2: Discretize

For certain tasks you might want to resort to binning, which is what Discretize does. It effectively distributes your continuous values into a selected number of bins, thus making the variable discrete-like. You can either discretize all your data variables at once, using selected discretization type, or select a particular discretization method for each attribute. The cool thing is the transformation is already displayed in the widget, so you instantly know what you’re getting in the end. A good example of discretization would be having a data set of your customers with their age recorded. It would make little sense to segment customers by each particular age, so binning them into 4 age groups (young, young-adult, middle-aged, senior) would be a great solution. Also some visualizations require feature transformation – Sieve Diagram is currently one such widget. Mosaic Display, however, has the transformation already implemented internally.

 

discretize1
original data

 

discretize2
Discretized data with ‘years employed’ lower or higher then/equal to 8 (same for ‘yearly income’ and ‘experience in the industry’.

 

Widget no. 3: Continuize

This widget essentially creates new attributes out of your discrete ones. If you have, for example, an attribute with people’s eye color, where values can be either blue, brown or green, you would probably want to have three separate attributes ‘blue’, ‘green’ and ‘brown’ with 0 or 1 if a person has that eye color. Some learners perform much better if data is transformed in such a way. You can also only have attributes where you would presume 0 is a normal condition and would only like to have deviations from the normal state recorded (‘target or first value as base’) or the normal state would be the most common value (‘most frequent value as base’). Continuize widget offers you a lot of room to play. Best thing is to select a small data set with discrete values, connect it to Continuize and then further to Data Table and change the parameters. This is how you can observe the transformations in real time. It is useful for projecting discrete data points in Linear Projection.

 

continuize1
Original data.

 

continuize2
Continuized data with two new columns – attribute ‘position’ was replaced by attributes ‘position=office worker’ and ‘position=technical staff’ (same for ‘gender’).

 

Widget no. 4: Purge Domain

Get a broom and sort your data! That’s what Purge Domain does. If all of the values of some attributes are constant, it will remove these attributes. If you have unused (empty) attributes in your data, it will remove them. Effectively, you will get a nice and comprehensive data set in the end.

purge1
Original data.

 

purge2
Empty columns and columns with the same (constant) value were removed.

 

Of course, don’t forget to include all these procedures into your report with the ‘Report’ button! 🙂

Tips and Tricks for Data Preparation

Probably the most crucial step in your data analysis is purging and cleaning your data. Here are a couple of cool tricks that will make your data preparation a bit easier.

 

1. Use a smart text editor. We can recommend Sublime Text as it an extremely versatile editor that supports a broad variety of programming languages and markups, but there are other great tools out there as well. One of the best things you’ll keep coming back to in your editor is ‘Replace’ function that allows you to replace specified values with different ones. You can also use regex to easily find and replace parts of text.

editing data with Sublime
We can replace all instances of ‘male’ with ‘man’ in one click.

 

2. Apply simple hacks. Sometimes when converting files to different formats data can get some background information appended that you cannot properly see. A cheap and dirty trick is to manually select the cells and rows and copy-paste them to a new sheet. This will start with a clean slate and you data will be read properly.

 

3. Check your settings. When reading .csv files in Excel, you might see all your data squished in one column and literally separated with commas. This can be easily solved with Data –> From Text (Get external data) and a new window will appear. In a Text Import Wizard you can set whether your data is delimited or not (in our case it is), how it is delimited (comma, tab, etc.), whether you have a header or not, what qualifies as text (” is a recommended option), what is your encoding and so on.

 

4. Manually annotate the data. Orange loves headers and the easiest way to assure your data gets read properly is to set the header yourself. Add two extra rows under your feature names. In the first row, set your variable type and in the second one, your kind. Here’s how to do it properly.

 

5. Exploit the widgets in Orange. Select Columns is your go-to widget for organizing what gets read as a meta attribute, what is your class variable and which features you want to use in your analysis. Another great widget is Edit domain, where you can set the way the values are displayed in the analysis (say you have “grey” in your data, but you want it to say “gray”). Moreover, you can use Concatenate and Merge widgets to put your data together.

setting the domain
Set domain with Edit domain widget.

 

What’s your preferred trick?

Creating a new data table in Orange through Python

IMPORT DATA

 

One of the first tasks in Orange data analysis is of course loading your data. If you are using Orange through Python, this is as easy as riding a bike:

import Orange
data = Orange.data.Table(“iris”)
print (data)

This will return a neat data table of the famous Iris data set in the console.

 

CREATE YOUR OWN DATA TABLE

 

What if you want to create your own data table from scratch? Even this is surprisingly simple. First, import the Orange data library.

from Orange.data import *

 

Set all the attributes you wish to see in your data table. For discrete attributes call DiscreteVariable and set the name and the possible values, while for a continuous variable call ContinuousVariable and set only the attribute name.

color = DiscreteVariable(“color”, values=[“orange”, “green”, “yellow”])

calories = ContinuousVariable(“calories”)

fiber = ContinuousVariable(“fiber”)]

fruit = DiscreteVariable("fruit”, values=[”orange", “apple”, “peach”])

 

Then set the domain for your data table. See how we set class variable with class_vars?

domain = Domain([color, calories, fiber], class_vars=fruit)

 

Time to input your data!

data = Table(domain, [

[“green”, 4, 1.2, “apple”],

["orange", 5, 1.1, "orange"],

["yellow", 4, 1.0, "peach"]])

 

And now print what you have created!

print(data)

 

One final step:

Table.save(table, "fruit.tab")

 

Your data is safely stored to your computer (in the Python folder)! Good job!

Loading your data

By a popular demand, we have just published a tutorial on how to load the data table into Orange. Besides its own .tab format, Orange can load any tab or comma delimited data set. The details are though in writing header rows that tell Orange about the type and domain of each attribute. The tutorial is a step-by-step description on how to do this and how to transfer the data from popular spreadsheet programs like Excel.

Paint Your Data

One of the widgets I enjoy very much when teaching introductory course in data mining is the Paint Data widget. When painting in this widget I would intentionally include some clusters, or intentionally obscure them. Or draw them in any strange shape. Then I would discuss with students if these clusters are identified by k-means clustering or by hierarchical clustering. We would also discuss automatic scoring of the quality of clusters, come up with the idea of a silhouette (ok, already invented, but helps if you get this idea on your own as well). And then we would play with various data sets and clustering techniques and their parameters in Orange.

Like in the following workflow where I drew three clusters which were indeed recognized by k-means clustering. Notice that silhouette scoring correctly identified even the number of clusters. And I also drew the clustered data in the Scatterplot to check if the clusters are indeed where they should be.

PaintData-k-Means-ok.png

Or like in the workflow below where k-means fails miserably (but someother clustering technique would not).

PaintData-k-Means-notok.png

Paint Data can also be used in supervised setting, for classification tasks. We can set the intended number of classes, and then chose any of these to paint the data. Below I have used it to create the datasets to check the behavior of several classifiers.

PaintData-Supervised.png

There are tons of other workflows where Paint Data can be useful. Give it a try!