This weekend we were in Lisbon, Portugal, at the Why the World Needs Anthropologists conference, an event that focuses on applied anthropology, design, and how soft skills can greatly benefit the industry. I was there to hold a workshop on Data Ethnography, an approach that tries to combine methods from data science and anthropology into a fruitful interdisciplinary mix!
Data ethnography is a novel methodological approach that tries to view social phenomena from two different points of view – qualitative and quantitative. The quantitative approach is using data mining and machine learning methods on anthropological data (say from sensors, wearables, social media, online fora, field notes and so on) trying to find interesting patterns and novel information. The qualitative approach uses ethnography to substantiate the analytical findings with context, motivations, values, and other external data to provide a complete account of the studied phenomenon.
At the workshop, I presented a couple of approaches I use in my own research, namely text mining, clustering, visualization of patterns, image analytics, and predictive modeling. Data ethnography can be used, not only in its native field of computational anthropology, but also in museology, digital anthropology, medical anthropology, and folkloristics (the list is probably not exhaustive). There are so many options just waiting for the researchers to dig in!
However, having data- and tech-savvy anthropologists does not only benefit the research, but opens a platform for discussing the ethics of data science, human relationships with technology, and overcoming model bias. Hopefully, the workshop inspired some of the participants to join me on a journey through the amazing expanses of data science.
In the past couple of weeks we have been working hard on introducing a better language support for the Text add-on. Until recently, Orange supported only a limited number of languages, mostly English and some bigger languages, such as Spanish, German, Arabic, Russian… Language support was most evident in the list of stopwords, normalization and POS tagging.
Stopwords come from NLTK library, so we can only offer whatever is available there. However, TF-IDF already implicitly considers stopwords, so the functionality is already implemented. For POS tagging, we would rely on Stanford POS tagger, that already has pre-trained models available.
The main issue was with normalization. While English can do without lemmatization and stemming for simple tasks, morphologically rich languages, such as Slovenian, perform poorly on un-normalized tokens. Cases and declensions present a problem for natural language processing, so we wanted to provide a tool for normalization in many different languages. Luckily, we found UDPipe, a Czech initiative that offers trained lemmatization models for 50 languages. UDPipe is actually a preprocessing pipeline and we are already thinking about how to bring all of its functionality to Orange, but let us talk a bit about the recent improvements for normalization.
Let us load a simple corpus from Corpus widget, say grimm-tales-selected.tab that contain 44 tales from the Grimm Brothers. Now, pass them through Preprocess Text and keep just the defaults, namely lowercase transformation, tokenization by words, and removal of stopwords. Here we see that we have came as quite a frequent word and come as a bit less frequent. But semantically, they are the same word from the verb to come. Shouldn’t we consider them as one word?
We can. This is what normalization does – it transforms all words into their lemmas or basic grammatical form. Came and come will become come, sons and son will become son, pretty and prettier will become pretty. This will result in less tokens that capture the text better, semantically speaking.
We can see that came became come with 435 counts. Went became go. Said became say. And so on. As we said, this doesn’t work only on verbs, but on all word forms.
One thing to note here. UDPipe has an internal tokenizer, that works with sentences instead of tokens. You can enable it by selecting UDPipe tokenizer option. What is the difference? A quicker version would be to tokenize all the words and just look up their lemma. But sometimes this can be wrong. Consider the sentence:
I am wearing a tie to work.
Now the word tie is obviously a piece of clothing, which is indicated by the word wearing before it. But tie alone can also be the verb to tie. So the UDPipe tokenizer will consider the entire sentence and correctly lemmatize this word, while lemmatization on regular tokens might not. While UDPipe works better, it is also slower, so you might want to work with regular tokenization to speed up the analysis.
Finally, UDPipe does not remove punctuation, so you might end up with words like rose. and away., with the full stop at the end. This you can fix with using regular tokenization and also by select the Regex option in Filtering, which will remove pure punctuation.
This is it. UDPipe contains lemmatization models for 50 languages and only when you click on a particular language in the Language option, will the resource be loaded, so your computer won’t be flooded with models for languages you won’t ever use. The installation of UDPipe could also be a little tricky, but after some initial obstacles, we have managed to prepare packages for both pip (OSX and Linux) and conda (Windows).
We hope you enjoy the new possibilities of a freshly multilingual Orange!
Did you know that Orange has already been to space? Rosario Brunetto (IAS-Orsay, France) has been working on the analysis of infrared images of asteroid Ryugu as a member of the JAXA Hayabusa2 team. The Hayabusa2 asteroid sample-return mission aims to retrieve data and samples from the near-Earth Ryugu asteroid and analyze its composition. Hayabusa2 arrived at Ryugu on June 27 and while the spacecraft will return to Earth with a sample only in late 2020, the mission already started collecting and sending back the data. And of course, a part of the analysis of Hayabusa’s space data has been done in Orange!
Dr. Brunetto is currently working with the first part of the data, namely the macro hyperspectral images of the asteroid. Several tens of thousands of spectra over 70 spectral channels have already been acquired. The main goal of this initial exploration was to constrain the surface composition.
Once the data was preprocessed and cleaned in Python, separate surface regions were extracted in Orange with k-Means and PCA and plotted with the HyperSpectra widget, which comes as a part of the Spectroscopy package. So why was Orange chosen over other tools? Dr. Brunetto says Orange is an easy and friendly tool for complicated things, such as exploring the compositional diversity of the asteroid at the different scales. There are many clustering techniques he can use in Orange and he likes how he can interactively change the number of clusters and the changes immediately show in the plot. This enables the researchers to determine the level of granularity of the analysis, while they can also immediately inspect how each cluster looks like in a hyperspectra plot.
Moreover, one can quickly test methods and visualize the effects and at the same time have a good overview of the workflow. Workflows can also be reused once the new data comes in or, if the pipeline is standard, used on a completely different data set!
We would of course love to show you the results of the asteroid analysis, but as the project is still ongoing, the data is not yet available to the public. Instead, we asked Zélia Dionnet, dr. Brunetto’s PhD student, to share the results of her work on the organic and mineralogic heterogeneity of the Paris meteorite, which were already published.
She analyzed the composition of the Paris meteorite, which was discovered in 2008 in a statue. The story of how the meteorite was found is quite interesting in itself, but we wanted to know more on how the sample was analyzed in Orange. Dionnet had a slightly larger data set, with 16,000 spectra and 1600 wavenumbers. Just like dr. Brunetto, she used k-Means to discover interesting regions in the sample and Hyperspectra widget to plot the results.
At the top, you can see a 2D map of the meteorite sample showing the distribution of the clusters that were identified with k-Means. At the bottom, you see cluster averages for the spectra. The green region is the most interesting one and it shows crystalline minerals, which formed billions of years ago as the hydrothermal processes in the asteroid parent body of the meteorite turned amorphous silicates into phyllosilicates. The purple, on the contrary, shows different micro-sized minerals.
This is how to easily identify the compositional structure of samples with just a couple of widgets. Orange seems to love going to space and can’t wait to get its hands dirty with more astro-data!
In the past month, we had two workshops that focused on text mining. The first one, Faksi v praksi, was organized by the University of Ljubljana Career Centers, where high school students learned about what we do at the Faculty of Computer and Information Science. We taught them what text mining is and how to group a collection of documents in Orange. The second one took on a more serious note, as the public sector employees joined us for the third set of workshops from the Ministry of Public Affairs. This time, we did not only cluster documents, but also built predictive models, explored predictions in nomogram, plotted documents on a map and discovered how to find the emotion in a tweet.
These workshops gave us a lot of incentive to improve the Text add-on. We really wanted to support more languages and add extra functionalities to widgets. In the upcoming week, we will release the 0.5.0 version, which introduces support for Slovenian in Sentiment Analysis widget, adds concordance output option to Concordances and, most importantly, implements UDPipe lemmatization, which means Orange will now support about 50 languages! Well, at least for normalization. 😇
Today, we will briefly introduce sentiment analysis for Slovenian. We have added the KKS 1.001 opinion corpus of Slovene web commentaries, which is a part of the CLARIN infrastructure. You can access it in the Corpus widget. Go to Browse documentation corpora and look for slo-opinion-corpus.tab. Let’s have a quick view in a Corpus Viewer.
The data comes from comment sections of Slovenian online media and contains a fairly expressive language. Let us observe, whether a post is negative or positive. We will use Sentiment Analysis widget and select the Liu Hu method for Slovenian. This is a dictionary based method, where the algorithm sums the positive words and subtracts the sum of negative words. This gives a final score of the post.
We will have to adjust the attributes for a nicer view in a Select Columns widget. Remove all attributes other than sentiment.
Finally, we can observe the results in a Heat Map. The blue lines are the negative posts, while the yellow ones are positive. Let us select the most positive tweets and see, what they are about.
Looks like Slovenians are happy, when petrol gets cheaper and sports(wo)men are winning. We can relate.
Of course, there are some drawbacks of lexicon-based methods. Namely, they don’t work well with phrases, they often don’t consider modern language (see ‘Jupiiiiiii’ or ‘Hooooooraaaaay!’, where the more the letters, the more expressive the word is) and they fail with sarcasm. Nevertheless, even such crude methods give us a nice glimpse into the corpus and enable us to extract interesting documents.
Stay tuned for the information on the release date and the upcoming post on UDPipe infrastructure!
On Kickstarter most app ideas don’t get funded. But why is that? When we are looking for possible explanations, it is easy to ascribe the failure to the type of the idea.
But what about those rare cases, where an app idea gets funded? Can we figure out why a particular idea succeeded? Our new widget Explain Predictions can do just that – explain why they will succeed. Or at least, explain why the classifier thinks they will.
First, let us load the Kickstarter data from the Datasets widget and inspect it in a Data Table.
Now, let’s see why the app Create Games & Apps Without Any Coding got funded.
Explain Predictions needs 3 inputs. Our data set, a classifier and a data sample that we wish to inspect. Connect File widget with Explain Predictions. Then add the classifier, say, Logistic Regression. Finally, select Create Games & Apps Without Any Coding in the Data Table and connect it to the widget.
The highest ranking attributes are those that contributed the most (high Score value). The fact that there were 11 pledge levels, 13 images, many connections to other projects and the length of the project description – all of these attributes add something positive to the funding. On the other side, we see how the duration of the project, description length, maximal pledge tiers and the type of the idea work against the decision to fund the project. Lastly, not having a Facebook page or a video amounts to almost nothing in the making of the final prediction.
When explaining the decision of the classifier, we look at the values of the attributes for our sample and how they interact. We do that by approximating Shapely value, since calculating it exactly would sometimes take more then a lifetime. That means customized explanations for every individual case, while treating classifier like a black box. You could do the same for any model the Orange offers, including Neural Networks!
And there you have it, an easy way to know what makes your Kickstarter campaign succeed, cell be classified as healthy, or a bank loan approved.
Last week Blaž, Marko and I held a week long introductory Data Mining and Machine Learning course at the Ljubljana Doctoral Summer School 2018. We got a room full of dedicated students and we embarked on a journey through standard and advanced machine learning techniques, all presented of course in Orange. We have covered a wide array of topics, from different clustering techniques (hierarchical clustering, k-means) to predictive models (logistic regression, naive Bayes, decision trees, random forests), regression and regularization, projections, text mining and image analytics.
Definitely the biggest crowd-pleaser was the Geo add-on in combination with the HDI data set. First, we got the HDI data from Datasets. A quick glimpse into a data table to check the output. We have information on some key performance indicators gathered by the United Nations for 188 countries. Now we would like to know which countries are similar based on the reported indicators. We will use Distances with Euclidean distance and use Ward linkage in Hierarchical Clustering.
We got our results in a dendrogram. Interestingly, the United States seems similar to Cuba. Let us select this cluster and inspect what the most significant feature for this cluster. We will use the Data output of Hierarchical Clustering which append a column indicating whether the data instances was selected or not. Then we will use Box Plot, group by Selected and check Order by relevance. It seems like these countries have the longest life expectancy at age 59. Go ahead and inspect other clusters by yourself!
Of course, when we are talking about countries one naturally wants to see them on a map! That is easy. We will use the Geo add-on. First, we need to convert all the country names to geographical coordinates. We will do this with Geocoding, where we will encode column Country to latitude and longitude. Remember to use the same output as before, that is Data to Data.
Now, let us display these countries on a map with Choropleth widget. Beautiful. It is so easy to explore country data, when you see it on a map. You can try coloring also by HDI or any other feature.
The final workflow:
We always try to keep our workshops fresh and interesting and visualizations are the best way to achieve this. Till the next workshop!
This week we held our first Girls Go Data Mining workshop. The workshop brought together curious women and intuitively introduced them to essential data mining and machine learning concepts. Of course, we used Orange to explore visualizations, build predictive models, perform clustering and dive into text analysis. The workshop was supported by NumFocus through their small development grant initiative and we hope to repeat it next year with even more ladies attending!
In two days, we covered many topics. On day one, we got to know Orange and the concept of visual programming, where the user construct analytical workflow by stacking visual components. Then we got to know several useful visualizations, such as box plot, scatter plot, distributions, and mosaic display, which give us an initial overview of the data and the potentially interesting patterns. Finally, we got our hands dirty with predictive modeling. We learnt about decision trees, logistic regression, and naive Bayes classifiers, and observed the models in tree viewer and nomogram. It is great having interpretable models and we had great fun exploring what is in the model!
On the second day, we tried to uncover groups in our data with clustering. First, we tried hierarchical clustering and explored the discovered clusters with box plot. Then we also tried k-means and learnt, why this method is better than hierarchical clustering. In the final part, we talked about the methods for text mining, how to do preprocessing, construct a bag of words and perform the machine learning on corpora. We used both clustering and classification and tried to find interesting information about Grimm tales.
One thing that always comes up as really useful in our workshops is Orange’s ability to output different types of data. For example, in Hierarchical Clustering, we can select the similarity cutoff at the top and output clusters. Our data table will have an additional column Cluster, with cluster labels for each data instance.
We can explore clusters by connecting a Box Plot to Hierarchical Clustering, selecting Cluster in Subgroups and using Order by relevance option. This sorts the variables in Box Plot by how well they separate between clusters or, in other words, what is typical of each cluster.
We used zoo.tab and made the cutoff at three clusters. It looks like the first cluster gives milk. Could these be a cluster of mammals?
Indeed it is!
Another option is to select a specific cluster in the dendrogram. Then, we have to rewire the connection between Hierarchical Clustering and Box Plot by setting it to Data. Data option will output the entire data set, with an extra column showing whether the data instance was selected or not. In our case, there would be a Yes if the instance is in the selected cluster and No if it is not.
Then we can use Box Plot to observe what is particular for our selected cluster.
It looks like animals from our selected cluster have feathers. Probably, this is a cluster of birds. We can check this with the same procedure as above.
In summary, most Orange visualizations have two outputs – Selected Data and Data. Selected Data will output a subset of data instances selected in the visualization (or selected clusters in the case of hierarchical clustering), while Data will output the entire data table with a column defining whether a data instance was selected or not. This is very useful if we want to inspect what is typical of an interesting group in our data, inspect clusters or even manually define groups.
Overall, this was another interesting workshop and we hope to continue our fruitful partnership with NumFocus and keep offering free educational events for beginners and experts alike!
Today we have finished a series of workshops for the Ministry of Public Affairs. This was a year-long cooperation and we had many students asking many different questions. There was however one that we talked about a lot. If I have a survey, how do I get it into Orange?
We are using EnKlik Anketa service, which is a great Slovenian product offering a wide array of options for the creation of surveys. We have created one such simple survey to use as a test. I am now inside EnKlik Anketa online service and I can see my survey has been successfully filled out.
Now I have to create a public link to my survey in order to access the data in Orange. I have to click on an icon in the top right part and select ‘Public link’.
A new window opens, where I select ‘Add new public link’. This will generate a public connection to my survey results. But be careful, the type of the connection needs to be Data, not Analysis! Orange can’t read already analyzed data, it needs raw data from Data pane.
Now, all I have to do is open Orange, place EnKlik Anketa widget from the Prototypes add-on onto the canvas, enter the public link into the ‘Public link URL’ fields and press Enter. If your data has loaded successfully, the widget will display available variables and information in the Info pane.
From here on you can continue your analysis just like you would with any other data source!
Last week Marko and I visited the land of the midnight sun – Norway! We held a two-day workshop on spectroscopy data analysis in Orange at the Norwegian University of Life Sciences. The students from BioSpec lab were yet again incredible and we really dug deep into Orange.
One thing we did was see how to join data from two different sources. It would often happen that you have measurements in one file and the labels in the other. Or in our case, we wanted to add images to our zoo.tab data. First, find the zoo.tab in the File widget under Browse documentation datasets. Observe the data in the Data Table.
This data contains 101 animal described with 16 different features (hair, aquatic, eggs, etc.), a name and a type. Now we will manually create the second table in Excel. The first column will contain the names of the animals as they appear in the original file. The second column will contain links to images of animals. Open your favorite browser and find a couple of images corresponding to selected animals. Then add links to images below the image column. Just like that:
Remember, you need a three-row header to define the column that contains images. Under the image column add string in the second and type=image in the third row. This will tell Orange where to look for images. Now, we can check our animals in Image Viewer.
Finally, it is time to bring in the images to the existing zoo data set. Connect the original File to Merge Data. Then add the second file with animal images to Merge Data. The default merging method will take the first data input as original data and the second data as extra data. The column to match by is defined in the widget. In our case, it is the name column. This means Orange will look at the first name column and find matching instances in the second name column.
A quick look at the merged data shows us an additional image column that we appended to the original file.
This is the final workflow. Merge Data now contains a single data table on the output and you can continue your analysis from there.
Find out more about spectroscopy for Orange on our YouTube channel or contribute to the project on Github.
Python Script is this mysterious widget most people don’t know how to use, even those versed in Python. Python Script is the widget that supplements Orange functionalities with (almost) everything that Python can offer. And it’s time we unveil some of its functionalities with a simple example.
Example: Batch Transform the Data
There might be a time when you need to apply a function to all your attributes. Say you wish to log-transform their values, as it is common in gene expression data. In theory, you could do this with Feature Constructor, where you would log-transform every attribute individually. Sounds laborious? It’s because it is. Why else we have computers if not to reduce manual labor for certain tasks? Let’s do it the fast way – with Python Script.
First, open File widget and load geo-gds360.tab from Browse documentation data sets. This data set has 9485 features, so imagine having to transform each feature individually.
Instead, we will connect Python Script to File and use a simple script to apply the same transformation to all attributes.
import numpy as np
from Orange.data import Table
new_X = np.log(in_data.X)
out_data = Table(in_data.domain, new_X, in_data.Y, in_data.metas)
This is really simple. Use in_data.X, which accesses all features in the data set, to transform the data with np.log (or any other numpy function). Set out_data to new_X and, voila, the transformed data is on the output. In a few lines we have instantly handled all 9485 features.
You can inspect the data before and after transformation in a Data Table widget.
This is it. Now we can do our standard analysis on the transformed data. Even better! We can save our script and use it in Python Script widget any time we want.
For your convenience I have already added the
Log Attributes Script, so you can download and use it instantly!
Have a more interesting example with Python Script? We’d love to hear about it!