Diving Into Car Registration Data

Last week, we presented Orange at the Festival of Open Data, a mini-conference organized by the Slovenian government, dedicated to the promotion of transparent access to government data. In a 10 minute presentation, we showed how Orange can be used to visualize and explore what kinds of vehicles were registered for the first time in Slovenia in 2017.

The original dataset is available at the OPSI portal and it consists of 73 files, one for each month since January 2012. For the presentation, we focused on the 2017 data. If you want to follow along, you can download the merged dataset (first 9 months of 2017 as a single file). The workflow I used to prepare the data is also available.

When exploring the data, the first thing we do is take a look at distributions. If we observe the distribution of new and used cars bought by the gender of the buyer, we can see that men prefer used cars while women more often opt for a new car. Or we can observe the distribution by age to see that older people tend to buy newer cars.

But the true power of Orange can be seen if we visualize the data on a map. In order to do this, we need to first use Geocoding to map municipality names to regions which can be shown on a map by choosing the column that contains municipality name (C1.3-Obcina uporabnika) and clicking apply. Since municipalities in Slovenia are created all the time, not all of them can be matched. The right part of the widget allows us to map these small municipalities to the nearest region. Or we can just ignore them.

The geocoded data can be displayed with Choropleth. If we select attribute D.1-Znamka and aggregation by mode, we get a visualization showing the most frequently bought mode for each region. Care to guess which manufacturer corresponds to the pink(-ish) color? It’s Volkswagen, in some regions with Golf and in other regions with Passat. But the visualization gives us just the most frequent value for each municipality. What if we would like to know more? As is the case with all visualizations you can click on a specific region on a map to select it and get the corresponding data on the output. We can then use Purge Domain to ignore the models that were not sold in the selected region and Box Plot to visualize the distribution by the model or by the manufacturer.

In Box Plot, select D.1 Znamka as both the variable and Subgroup and you get an overview of the distribution of cars by manufacturers in the selected region. But that is just the first step. We can also take a look at the distribution of Fiat cars by adding another boxplot. Now you can select the manufacturer and get a detailed distribution of specific car models sold. If you take some care in positioning the windows, you can create an interactive explorer, where you click on regions and instantly see the detailed distributions in the connected boxplots.

The final workflow should look like this:

 

Understanding Voting Patterns at AKOS Workshop

Two days ago we held another Introduction to Data Mining workshop at our faculty. This time the target audience was a group of public sector professionals and our challenge was finding the right data set to explain key data mining concepts. Iris is fun, but not everyone is a biologist, right? Fortunately, we found this really nice data set with ballot counts from the Slovenian National Assembly (thanks to Parlameter).

Related: Intro to Data Mining for Life Scientists

Workshop for the Agency for Communication Networks and Services (AKOS).

 

The data contains ballot counts, statistics, and description for 84 members of the parliament (MPs). First, we inspected the data in a Data Table. Each MP is described with 14 meta features and has 18 ballot counts recorded.

Out data has 84 instances, 18 features (ballot counts) and 14 meta features (MP description).

 

We have some numerical features, which means we can also inspect the data in Scatter Plot. We will plot MPs’ attendance vs. the number of their initiatives. Quite interesting! There is a big group of MPs who regularly attend the sessions, but rarely propose changes. Could this be the coalition?

Scatter plot of MPs’ session attendance (in percentage) and the number of initiatives. Already an interesting pattern emerges.

 

The next question that springs to our mind is – can we discover interesting voting patterns from our data? Let us see. We first explored the data in Hierarchical Clustering. Looks like there are some nice clusters in our data! The blue cluster is the coalition, red the SDS party and green the rest (both from the opposition).

Related: Hierarchical Clustering: A Simple Explanation

Hierarchical Clustering visualizes a hierarchy of clusters. But it is hard to observe similarity of pairs of data instances. How similar are Luka Mesec and Branko Grims? It is hard to tell…

 

But it is hard to inspect so many data instances in a dendrogram. For example, we have no idea how similar are the voting records of Eva Irgl and Alenka Bratušek. Surely, there must be a better way to explore similarities and perhaps verify that voting patterns exist at even a party-level… Let us try MDS. MDS transforms multidimensional data into a 2D projection so that similar data instances lie close to each other.

MDS can plot a multidimensional data in 2D so that similar data points lie close to each other. But sometimes this optimization is hard. This is why we have grey lines connecting the dots – the dots connected are similar at the selected cut-off level (Show similar pairs slider).

 

Ah, this is nice! We even colored data points by the party. MDS beautifully shows the coalition (blue dots) and the opposition (all other colors). Even parties are clustered together. But there are some outliers. Let us inspect Matej Tonin, who is quite far away from his orange group. Seems like he was missing at the last two sessions and did not vote. Hence his voting is treated differently.

Data Table is a handy tool for instant data inspection. It is always great to check, what is on the output of each widget.

 

It is always great to inspect discovered groups and outliers. This way an expert can interpret the clusters and also explain, what outliers mean. Sometimes it is simply a matter of data (missing values), but sometimes we could find shifting alliances. Perhaps an outlier could be an MP about to switch to another party.

The final workflow.

 

You can have fun with these data, too. Let us know if you discover something interesting!

 

Orange at Station Houston

With over 262 member companies, Station Houston is the largest hub for tech startups in Houston.

One of its members is also Genialis, a life science data exploration company that emerged from our lab and is now delivering pipelines and user-friendly apps for analytics in systems biology.

Thanks to the invitation by the director of operations Alex de la Fuente, we gave a seminar on Data Science for Everyone. We spoke about how Orange can support anyone to learn about data science and then use machine learning on their own data.

We pushed on this last point: say you walk in downtown Houston, pick first three passersby, take them to the workshop and train them in machine learning. To the point where they could walk out from the training and use some machine learning at home. Say, cluster their family photos, or figure out what Kickstarter project features to optimize to get the funding.

How long would such workshop take? Our informed guess: three hours. And of course, we illustrated this point to seminar attendees by giving a demo of the clustering of images in Orange and showcasing Kickstarter data analysis.

Related: Image Analytics: Clustering

Seminars at Station Houston need to finish with a homework. So we delivered one. Here it is:

  1. Open your browser.
  2. Find some images of your interest (mountains, cities, cars, fish, dogs, faces, whatever).
  3. Place images in a folder (Mac: just drag the thumbnails, Win: right click and Save Image).
  4. Download & install Orange. From Orange, install Image Analytics add-on (Options, Add-Ons).
  5. Use Orange to cluster images. Does clustering make sense?

Data science and startups aside: there are some beautiful views from Station Houston. From the kitchen, there is a straight sight to Houston’s medical center looming about 4 miles away.

And on the other side, there is a great view of the downtown.

Can We Download Orange Faster?

One day Blaž and Janez came to us and started complaining how slow Orange download is in the US. Since they hold a large course at Baylor College of Medicine every year, this causes some frustration.

Related: Introduction to Data Mining Course in Houston

But we have the data and we’ve promptly tried to confirm their complaints by analyzing them… well, in Orange!

First, let us observe the data. We have 4887 recorded download sessions with one meta feature reporting on the country of the download and four features with time, size, speed in bytes and speed in gigabytes of the download.

Data of Orange download statistics. We get reports on the country of download, the size and the time of the download. We have constructed speed and size in gigabytes ourselves with simple formulae.

 

Now let us check the validity of Blaž’s and Janez’s complaint. We will use orange3-geo add-on for plotting geolocated data. For any geoplotting, we need coordinates – latitude and longitude. To retrieve them automatically, we will use Geocoding widget.

We instruct Geocoding to retrieve coordinates from our Country feature. Identifier type tells the widget in what format the region name appears.

 

We told the widget to use the ISO-compliant country code from Country attribute and encode it into coordinates. If we check the new data in a Data Table, we see our data is enhanced with new features.

Enhanced data table. Besides latitude and longitude, Geocoding can also append country-level data (economy, continent, region…).

 

Now that we have coordinates, we can plot these data regionally – in Choropleth widget! This widget plots data on three levels – country, state/region and county/municipality. Levels correspond to the administrative division of each country.

Choropleth widget offers 3 aggregation levels. We chose country (e.g. administrative level 0), but with a more detailed data one could also plot by state/county/municipality. Administrative levels are different for each country (e.g. Bundesländer for Germany, states for the US, provinces for Canada…).

 

In the plot above, we have simply displayed the amount of people (Count) that downloaded Orange in the past couple of months. Seems like we indeed have most users in the US, so it might make sense to solve installation issues for this region first.

Now let us check the speed of the download – it is really so slow in the US? If we take the mean, we can see that Slovenia is far ahead of the rest as far as download speed is concerned. No wonder – we are downloading via the local network. Scandinavia, Central Europe and a part of the Balkans seem to do quite ok as well.

Aggregation by mean.

 

But mean sometimes doesn’t show the right picture – it is sensitive to outliers, which would be the case of Slovenia here. Let us try median instead. Looks like 50% of American download at speed lower than 1.5MB/s. Quite average, but it could be better.

Aggregation by median.

 

And the longest time someone was prepared to wait for the download? Over 3 hours. Kudos, mate! We appreciate it! 🙌

This simple workflow is all it took to do our analysis.

 

So how is your download speed for Orange compared to other things you are downloading? Better, worse? We’re keen to hear it! 👂

It’s Sailing Time (Again)

Every fall I teach a course on Introduction to Data Mining. And while the course is really on statistical learning and its applications, I also venture into classification trees. For several reasons. First, I can introduce information gain and with it feature scoring and ranking. Second, classification trees are one of the first machine learning approaches co-invented by engineers (Ross Quinlan) and statisticians (Leo Breiman, Jerome Friedman, Charles J. Stone, Richard A. Olshen). And finally, because they make the base of random forests, one of the most accurate machine learning models for smaller and mid-size data sets.

Related: Introduction to Data Mining Course in Houston

Lecture on classification trees has to start with the data. Years back I have crafted a data set on sailing. Every data set has to have a story. Here is one:

Sara likes weekend sailing. Though, not under any condition. Past
twenty Wednesdays I have asked her if she will have any company, what
kind of boat she can rent, and I have checked the weather
forecast. Then, on Saturday, I wrote down if she actually went to the Sea.

Data on Sara’s sailing contains three attributes (Outlook, Company, Sailboat) and a class (Sail).

The data comes with Orange and you can get them from Data Sets widget (currently in Prototypes Add-On, but soon to be moved to core Orange). It takes time, usually two lecture hours, to go through probabilities, entropy and information gain, but at the end, the data analysis workflow we develop with students looks something like this:

And here is the classification tree:

Turns out that Sara is a social person. When the company is big, she goes sailing no matter what. When the company is smaller, she would not go sailing if the weather is bad. But when it is sunny, sailing is fun, even when being alone.

Related: Pythagorean Trees and Forests

Classification trees are not very stable classifiers. Even with small changes in the data, the trees can change substantially. This is an important concept that leads to the use of ensembles like random forests. It is also here, during my lecture, that I need to demonstrate this instability. I use Data Sampler and show a classification tree under the current sampling. Pressing on Sample Data button the tree changes every time. The workflow I use is below, but if you really want to see this in action, well, try it in Orange.

Text Analysis Workshop at Digital Humanities 2017

How do you explain text mining in 3 hours? Is it even possible? Can someone be ready to build predictive models and perform clustering in a single afternoon?

It seems so, especially when Orange is involved.

Yesterday, on August 7, we held a 3-hour workshop on text mining and text analysis for a large crowd of esteemed researchers at Digital Humanities 2017 in Montreal, Canada. Surely, after 3 hours everyone was exhausted, both the audience and the lecturers. But at the same time, everyone was also excited. The audience about the possibilities Orange offers for their future projects and the lecturers about the fantastic participants who even during the workshop were already experimenting with their own data.

The biggest challenge was presenting the inner workings of algorithms to a predominantly non-computer science crowd. Luckily, we had Tree Viewer and Nomogram to help us explain Classification Tree and Logistic Regression! Everything is much easier with vizualizations.

 

Classification Tree splits first by the word ‘came’, since it results in the purest split. Next it splits by ‘strange’. Since we still don’t have pure nodes, it continues to ‘bench’, which gives a satisfying result. Trees are easy to explain, but can quickly overfit the data.

 

Logistic Regression transforms word counts to points. The sum of points directly corresponds to class probability. Here, if you see 29 foxes in a text, you get a high probability of Animal Tales. If you don’t see any, then you get a high probability of the opposite class.

 

At the end, we were experimenting with explorative data analysis, where we had Hierarchical Clustering, Corpus Viewer, Image Viewer and Geo Map opened at the same time. This is how a researcher can interactively explore the dendrogram, read the documents from selected clusters, observe the corresponding images and locate them on a map.

Hierarchical Clustering, Image Viewer, Geo Map and Corpus Viewer opened at the same time create an interactive data browser.

 

The workshop was a nice kick-off to an exciting week full of interesting lectures and presentations at Digital Humanities 2017 conference. So much to learn and see!

 

 

Text Analysis: New Features

As always, we’ve been working hard to bring you new functionalities and improvements. Recently, we’ve released Orange version 3.4.5 and Orange3-Text version 0.2.5. We focused on the Text add-on since we are lately holding a lot of text mining workshops. The next one will be at Digital Humanities 2017 in Montreal, QC, Canada in a couple of days and we simply could not resist introducing some sexy new features.

Related: Text Preprocessing

Related: Rehaul of Text Mining Add-On

First, Orange 3.4.5 offers better support for Text add-on. What do we mean by this? Now, every core Orange widget works with Text smoothly so you can mix-and-match the widgets as you like. Before, one could not pass the output of Select Columns (data table) to Preprocess Text (corpus), but now this is no longer a problem.

Of course, one still needs to keep in mind that Corpus is a sparse data format, which does not work with some widgets by design. For example, Manifold Learning supports only t-SNE projection.

 

Second, we’ve introduced two new widgets, which have been long overdue. One is Sentiment Analysis, which enables basic sentiment analysis of corpora. So far it works for English and uses two nltk-supported techniques – Liu Hu and Vader. Both techniques are lexicon-based. Liu Hu computes a single normalized score of sentiment in the text (negative score for negative sentiment, positive for positive, 0 is neutral), while Vader outputs scores for each category (positive, negative, neutral) and appends a total sentiment score called a compound.

Liu Hu score.
Vader scores.

 

Try it with Heat Map to visualize the scores.

Yellow represent a high, positive score, while blue represent a low, negative score. Seems like Animal Tales are generally much more negative than Tales of Magic.

 

The second widget we’ve introduced is Import Documents. This widget enables you to import your own documents into Orange and outputs a corpus on which you can perform the analysis. The widget supports .txt, .docx, .odt, .pdf and .xml files and loads an entire folder. If the folder contains subfolders, they will be considered as class values. Here’s an example.

This is the structure of my Kennedy folder. I will load the folder with Import Documents. Observe, how Orange creates a class variable category with post-1962 and pre-1962 as class values.

Subfolders are considered as class in the category column.

 

Now you can perform your analysis as usual.

 

Finally, some widgets have cool new updates. Topic Modelling, for example, colors words by their weights – positive weights are colored green and negative red. Coloring only works with LSI, since it’s the only method that outputs both positive and negative weights.

If there are many kings in the text and no birds, then the text belongs to Topic 2. If there are many children and no foxes, then it belongs to Topic 3.

 

Take some time, explore these improvements and let us know if you are happy with the changes! You can also submit new feature requests to our issue tracker.

 

Thank you for working with Orange! 🍊

Support Orange Developers

Do you love Orange? Do you think it is the best thing since sliced bread? Want to thank all the developers for their hard work?

Nothing says thank you like a fresh supply of ice cream and now you can help us stock our fridge with your generous donations. 🍦🍦🍦



Support open source software and the team behind Orange. We promise to squander all your contributions purely on ice cream. Can’t have a development sprint without proper refreshments! 😉

Thank you in advance for all the contributions, encouragement and support! It wouldn’t be worth it without you.

🍊Orange team🍊

Miniconda Installer

Orange has a new friend! It’s Miniconda, Anaconda’s little sister.

 

For a long time, the idea was to utilize the friendly nature of Miniconda to install Orange dependencies, which often misbehaved on some platforms. Miniconda provides Orange with Python 3.6 and conda installer, which is then used to handle everything Orange needs for proper functioning. So sssssss-mooth!

Miniconda Installer

Please know that our Miniconda installer is in a beta state, but we are inviting adventurous testers to try it and report any bugs they find to our issue tracker [there won’t be any of course! 😉 ].

 

Happy testing! 🐍|🍊

 

 

Text Preprocessing

In data mining, preprocessing is key. And in text mining, it is the key and the door. In other words, it’s the most vital step in the analysis.

Related: Text Mining add-on

So what does preprocessing do? Let’s have a look at an example. Place Corpus widget from Text add-on on the canvas. Open it and load Grimm-tales-selected. As always, first have a quick glance of the data in Corpus Viewer. This data set contains 44 selected Grimms’ tales.

Now, let us see the most frequent words of this corpus in a Word Cloud.

Ugh, what a mess! The most frequent words in these texts are conjunctions (‘and’, ‘or’) and prepositions (‘in’, ‘of’), but so they are in almost every English text in the world. We need to remove these frequent and uninteresting words to get to the interesting part. We remove the punctuation by defining our tokens. Regexp \w+ will keep full words and omit everything else. Next, we filter out the uninteresting words with a list of stopwords. The list is pre-set by nltk package and contains frequently occurring conjunctions, prepositions, pronouns, adverbs and so on.

Ok, we did some essential preprocessing. Now let us observe the results.

This does look much better than before! Still, we could be a bit more precise. How about removing the words could, would, should and perhaps even said, since it doesn’t say much about the content of the tale? A custom list of stopwords would come in handy!

Open a plain text editor, such as Notepad++ or Sublime, and place each word you wish to filter on a separate line.

Save the file and load it next to the pre-set stopword list.

One final check in the Word Cloud should reveal we did a nice job preparing our data. We can now see the tales talk about kings, mothers, fathers, foxes and something that is little. Much more informative!

Related: Workshop: Text Analysis for Social Scientists