Data Mining and Machine Learning for Economists

Last week Blaž, Marko and I held a week long introductory Data Mining and Machine Learning course at the Ljubljana Doctoral Summer School 2018. We got a room full of dedicated students and we embarked on a journey through standard and advanced machine learning techniques, all presented of course in Orange. We have covered a wide array of topics, from different clustering techniques (hierarchical clustering, k-means) to predictive models (logistic regression, naive Bayes, decision trees, random forests), regression and regularization, projections, text mining and image analytics.

Related: Data Mining for Business and Public Administration

Definitely the biggest crowd-pleaser was the Geo add-on in combination with the HDI data set. First, we got the HDI data from Datasets. A quick glimpse into a data table to check the output. We have information on some key performance indicators gathered by the United Nations for 188 countries. Now we would like to know which countries are similar based on the reported indicators. We will use Distances with Euclidean distance and use Ward linkage in Hierarchical Clustering.

 

In Datasets widget we have selected the HDI data set.

 

The HDI data set contains information on 188 countries, which are described with 66 features. The data set can be used for regression, but we will perform clustering to discover countries, similar by the proposed parameters.

 

We got our results in a dendrogram. Interestingly, the United States seems similar to Cuba. Let us select this cluster and inspect what the most significant feature for this cluster. We will use the Data output of Hierarchical Clustering which append a column indicating whether the data instances was selected or not. Then we will use Box Plot, group by Selected and check Order by relevance. It seems like these countries have the longest life expectancy at age 59. Go ahead and inspect other clusters by yourself!

Select an interesting cluster in Hierarchical Clustering.

 

And inspect the results in a box plot. Seems like the selected cluster stands out from the other countries by high life expectancy.

 

Of course, when we are talking about countries one naturally wants to see them on a map! That is easy. We will use the Geo add-on. First, we need to convert all the country names to geographical coordinates. We will do this with Geocoding, where we will encode column Country to latitude and longitude. Remember to use the same output as before, that is Data to Data.

Use Encode to convert a column with region identifiers (in our case Country) to latitude/longitude pairs.

 

Now, let us display these countries on a map with Choropleth widget. Beautiful. It is so easy to explore country data, when you see it on a map. You can try coloring also by HDI or any other feature.

Choropleth shows us which countries were in the selected cluster (red). We used Selected as attribute and colored by Mode.

 

The final workflow:

We always try to keep our workshops fresh and interesting and visualizations are the best way to achieve this. Till the next workshop!

 

 

 

 

 

 

Diving Into Car Registration Data

Last week, we presented Orange at the Festival of Open Data, a mini-conference organized by the Slovenian government, dedicated to the promotion of transparent access to government data. In a 10 minute presentation, we showed how Orange can be used to visualize and explore what kinds of vehicles were registered for the first time in Slovenia in 2017.

The original dataset is available at the OPSI portal and it consists of 73 files, one for each month since January 2012. For the presentation, we focused on the 2017 data. If you want to follow along, you can download the merged dataset (first 9 months of 2017 as a single file). The workflow I used to prepare the data is also available.

When exploring the data, the first thing we do is take a look at distributions. If we observe the distribution of new and used cars bought by the gender of the buyer, we can see that men prefer used cars while women more often opt for a new car. Or we can observe the distribution by age to see that older people tend to buy newer cars.

But the true power of Orange can be seen if we visualize the data on a map. In order to do this, we need to first use Geocoding to map municipality names to regions which can be shown on a map by choosing the column that contains municipality name (C1.3-Obcina uporabnika) and clicking apply. Since municipalities in Slovenia are created all the time, not all of them can be matched. The right part of the widget allows us to map these small municipalities to the nearest region. Or we can just ignore them.

The geocoded data can be displayed with Choropleth. If we select attribute D.1-Znamka and aggregation by mode, we get a visualization showing the most frequently bought mode for each region. Care to guess which manufacturer corresponds to the pink(-ish) color? It’s Volkswagen, in some regions with Golf and in other regions with Passat. But the visualization gives us just the most frequent value for each municipality. What if we would like to know more? As is the case with all visualizations you can click on a specific region on a map to select it and get the corresponding data on the output. We can then use Purge Domain to ignore the models that were not sold in the selected region and Box Plot to visualize the distribution by the model or by the manufacturer.

In Box Plot, select D.1 Znamka as both the variable and Subgroup and you get an overview of the distribution of cars by manufacturers in the selected region. But that is just the first step. We can also take a look at the distribution of Fiat cars by adding another boxplot. Now you can select the manufacturer and get a detailed distribution of specific car models sold. If you take some care in positioning the windows, you can create an interactive explorer, where you click on regions and instantly see the detailed distributions in the connected boxplots.

The final workflow should look like this:

 

Can We Download Orange Faster?

One day Blaž and Janez came to us and started complaining how slow Orange download is in the US. Since they hold a large course at Baylor College of Medicine every year, this causes some frustration.

Related: Introduction to Data Mining Course in Houston

But we have the data and we’ve promptly tried to confirm their complaints by analyzing them… well, in Orange!

First, let us observe the data. We have 4887 recorded download sessions with one meta feature reporting on the country of the download and four features with time, size, speed in bytes and speed in gigabytes of the download.

Data of Orange download statistics. We get reports on the country of download, the size and the time of the download. We have constructed speed and size in gigabytes ourselves with simple formulae.

 

Now let us check the validity of Blaž’s and Janez’s complaint. We will use orange3-geo add-on for plotting geolocated data. For any geoplotting, we need coordinates – latitude and longitude. To retrieve them automatically, we will use Geocoding widget.

We instruct Geocoding to retrieve coordinates from our Country feature. Identifier type tells the widget in what format the region name appears.

 

We told the widget to use the ISO-compliant country code from Country attribute and encode it into coordinates. If we check the new data in a Data Table, we see our data is enhanced with new features.

Enhanced data table. Besides latitude and longitude, Geocoding can also append country-level data (economy, continent, region…).

 

Now that we have coordinates, we can plot these data regionally – in Choropleth widget! This widget plots data on three levels – country, state/region and county/municipality. Levels correspond to the administrative division of each country.

Choropleth widget offers 3 aggregation levels. We chose country (e.g. administrative level 0), but with a more detailed data one could also plot by state/county/municipality. Administrative levels are different for each country (e.g. Bundesländer for Germany, states for the US, provinces for Canada…).

 

In the plot above, we have simply displayed the amount of people (Count) that downloaded Orange in the past couple of months. Seems like we indeed have most users in the US, so it might make sense to solve installation issues for this region first.

Now let us check the speed of the download – it is really so slow in the US? If we take the mean, we can see that Slovenia is far ahead of the rest as far as download speed is concerned. No wonder – we are downloading via the local network. Scandinavia, Central Europe and a part of the Balkans seem to do quite ok as well.

Aggregation by mean.

 

But mean sometimes doesn’t show the right picture – it is sensitive to outliers, which would be the case of Slovenia here. Let us try median instead. Looks like 50% of American download at speed lower than 1.5MB/s. Quite average, but it could be better.

Aggregation by median.

 

And the longest time someone was prepared to wait for the download? Over 3 hours. Kudos, mate! We appreciate it! 🙌

This simple workflow is all it took to do our analysis.

 

So how is your download speed for Orange compared to other things you are downloading? Better, worse? We’re keen to hear it! 👂