Univariate GSoC Success

Google Summer of Code application period has come to an end. We’ve received 34 applications, some of which were of truly high quality. Now it’s upon us to select the top performing candidates, but before that we wanted to have an overlook of the candidate pool. We’ve gathered data from our Google Form application and gave it a quick view in Orange.

First, we needed to preprocess the data a bit, since it came in a messy form of strings. Feature Constructor to the rescue! We wanted to extract the OS usage across users. So we first made three new variables named ‘uses linux’, ‘uses windows’ and ‘uses osx’ to represent our three new columns. For each column we searched through ‘OS_of_choice_and_why’, looked up the value of the column, converted it to string, put the string in lowercase, found mentions of either ‘linux’, ‘windows’ or ‘osx’, and voila…. if a mention occurred in the string, we marked the column with 1, else with 0.

 

blog10

The expression is just a logical statement in Python and works with booleans (0 if False and 1 if True):

'linux' in str(OS_of_choice_and_why_.value).lower() or 'ubuntu' in str(OS_of_choice_and_why_.value).lower()

 

Another thing we might want to do is create three discrete values for ”Dogs or cats” question. We want Orange to display ‘dogs’ for someone who replied ‘dogs’, ‘cats’ for someone who replied ‘cats’ and ‘?’ if the questions was a blank or very creative (we had people who wanted to be elephants and butterflies 🙂 ).

To create three discrete values you would write:

0 if 'dogs' in str(Dogs_or_cats_.value).lower() else 1 if  'cats' in str(Dogs_or_cats_.value).lower() else 2

Since we have three values, we need to assign them the corresponding indexes. So if there is ‘dogs’ in the reply, we would get 0 (which we converted to ‘dogs’ in the Feature Constructor’s ‘Values’ box), 1 if there’s ‘cats’ in the reply and 2 if none of the above apply.

blog9

Ok, the next step was to sift through a big pile of attributes. We removed personal information for privacy concerns and selected the ones we cared about the most. For example programming skills, years of experience, contributions to OSS and of course whether someone is a dog or a cat person. 🙂 Select Columns sorts the problem. Here you can download a mock-up workflow (same as above, but without sensitive data).

Now for some lovely charts. Enjoy!

blog5
Python is our lingua franca, experts wanted!

 

blog8
20 years of programming experience? Hello outlier!

 

blog2
OSS all the way!

 

blog3
Some people love dogs and some love cats. Others prefer elephants and butterflies.

 

 

Mining our own data

Recently we’ve made a short survey that was, upon Orange download, asking people how they found out about Orange, what was their data mining level and where do they work. The main purpose of this is to get a better insight into our user base and to figure out what is the profile of people interested in trying Orange.

Here we have some preliminary results that we’ve managed to gather in the past three weeks or so. Obviously we will use Orange to help us make sense of the data.

 

We’ve downloaded our data from Typeform and appended some background information such as OS and browser. Let’s see what we’ve got in the Data Table widget.

blog-results7

 

Ok, this is our entire data table. Here we also have the data on people who completed the survey and who didn’t. First, let’s organize the data properly. We’ll do this with Select Columns widget.

blog-results

 

We removed all the meta attributes as they are not very relevant for our analysis. Next we moved the ‘completed’ attribute into target variable, thus making it our class variable.

blog-results2

 

Now we would like to see some basic distributions from our data.

blog-results3

 

Interesting. Most of our users are working on Windows, a few on Mac and very few on Linux.

Let’s investigate further. Now we want to know more about those people who actually completed the survey. Let’s use Select Columns again, this time removing os_type, os_name, agent_name and completed from our data and keeping just the answers. We made “Where do you work?” our class variable, but we could use either one of the three. Another trick is to set in directly in Distributions widget under ‘Group by’.

blog-results4

 

Ok, let’s again use Distributions – this is such a simple way to get a good sense of your data.

blog-results5

 

Obviously out of those who found out about Orange in college, most are students, but what’s interesting here is that there are so many. We can also see that out of those who found us on the web, most come from the private sector, followed by academia and researchers. Good. How about the other question?

blog-results6

 

Again, results are not particularly shocking, but it’s great to confirm your hypothesis with real data. Out of beginner level data miners, most are students, while most intermediate users come from the industry.

A quick look at the Mosaic Display will give us a good overview:

blog-results8

 

Yup, this sums it up quite nicely. We have lots of beginner levels users and not many expert ones (height of the box). Also most people found out about Orange on the web or in college (width of the box). A thin line on the left shows apriori distribution, thus making it easier to compare expected and actual number of instances. For example, there should be at least some people who are students and have found out about Orange at a conference. But there aren’t – a contrast between how much red there should be in the box (line on the left) and how much there actually is (bigger part of the box) is quite telling. We can even select all the beginner level users who found out about Orange in college and further inspect the data, but be it enough for now.

Our final workflow:

 

blog-results12

 

Obviously, this is a very simple analysis. But even such simple tasks are never boring with good visualization tools such as Distributions and Mosaic Display. You could also use Venn Diagram to find common features of selected subsets or perhaps Sieve Diagram for probabilities.

 

We are very happy to get these data and we would like to thank everyone who completed the survey. If you wish to help us further, please fill out a longer survey that won’t actually take you more than 3 minutes of your time (we timed it!).

 

Happy Friday everyone!

Ghostbusters

Ok, we’ve just recently stumbled across an interesting article on how to deal with non normal (non-Gaussian distributed) data.
paranormal1

We have an absolutely paranormal data set of 20 persons with weight, height, paleness, vengefulness, habitation and age attributes (download).

paranormal2

Let’s check the distribution in Distributions widget.

paranormal3

Our first attribute is “Weight” and we see a little hump on the left. Otherwise the data would be normally distributed. Ok, so perhaps we have a few children in the data set. Let’s check the age distribution.
paranormal4

Whoa, what? Why is the hump now on the right? These distributions look scary. We seem to have a few reaaaaally old people here. What is going on? Perhaps we can figure this out with MDS. This widget projects the data into two dimensions so that the distances between the points correspond to differences between the data instances.

paranormal5

Aha! Now we see that three instances are quite different from all others. Select them and send them to the Data Table for final inspection.

paranormal6

Busted! We have found three ghosts hiding in our data. They are extremely light (the sheet they are wearing must weight around 2kg), quite vengeful and old.

Now, joke aside, what would this mean for a general non-normally distributed data? One thing is your data set might be too small. Here we only have 20 instances, thus 3 outlying ghosts have a great impact on the distribution. It is difficult to hide 3 ghosts among 17 normal persons.

Secondly, why can’t we use Outliers widget to hunt for those ghosts? Again, our data set is too small. With just 20 instances, the estimation variance is so large that it can easily cover a few ghosts under its sheet. We don’t have enough “normal” data to define what is normal and thus detect the paranormal.

Haven’t we just written two exactly opposite things? Perhaps.

Happy Halloween everybody! 🙂

spooky-orange

Fink packages now also 64-bit

Fink packages (we are using for system-wide Orange installations on Mac OS X) were updated to 64-bit. So if you were using 64-bit Fink installation you will be now able also to use Orange (and our binary Fink repository of already compiled packages). Just use this this installation script to configure your local Fink installation to use our binary Fink repository and add information about Orange packages (they are not available among official Fink packages).

Debian repository lives!

We have made still-experimental-but-probably-working Debian repository with daily built Orange packages. Currently without add-ons.

To get access to those packages just add those two lines to your /etc/apt/sources.list (this file contains a list of repositories with packages):

deb http://orange.biolab.si/debian lenny main
deb-src http://orange.biolab.si/debian lenny main

And then you can install Orange with this command:

aptitude update
aptitude install orange-svn

Packages are not signed as they are made automatically so you will probably be warned about this.

Those packages will probably work also on other Debian-based Linux distributions like Ubuntu, but have not yet been tested there. Please test it and report how it goes.

You can also get source of those packages with this command:

apt-get source python-orange-svn

And then build package by yourself in extracted source directory with:

dpkg-buildpackage

For example this will be useful on amd64 platform for which we currently do not yet provide binary packages. (Edit: now we provide binary packages also for amd64 platform.) But we will once we see how well this system works.