This weekend we were in Lisbon, Portugal, at the Why the World Needs Anthropologists conference, an event that focuses on applied anthropology, design, and how soft skills can greatly benefit the industry. I was there to hold a workshop on Data Ethnography, an approach that tries to combine methods from data science and anthropology into a fruitful interdisciplinary mix!
Data ethnography is a novel methodological approach that tries to view social phenomena from two different points of view – qualitative and quantitative. The quantitative approach is using data mining and machine learning methods on anthropological data (say from sensors, wearables, social media, online fora, field notes and so on) trying to find interesting patterns and novel information. The qualitative approach uses ethnography to substantiate the analytical findings with context, motivations, values, and other external data to provide a complete account of the studied phenomenon.
At the workshop, I presented a couple of approaches I use in my own research, namely text mining, clustering, visualization of patterns, image analytics, and predictive modeling. Data ethnography can be used, not only in its native field of computational anthropology, but also in museology, digital anthropology, medical anthropology, and folkloristics (the list is probably not exhaustive). There are so many options just waiting for the researchers to dig in!
However, having data- and tech-savvy anthropologists does not only benefit the research, but opens a platform for discussing the ethics of data science, human relationships with technology, and overcoming model bias. Hopefully, the workshop inspired some of the participants to join me on a journey through the amazing expanses of data science.
In the past couple of weeks we have been working hard on introducing a better language support for the Text add-on. Until recently, Orange supported only a limited number of languages, mostly English and some bigger languages, such as Spanish, German, Arabic, Russian… Language support was most evident in the list of stopwords, normalization and POS tagging.
Stopwords come from NLTK library, so we can only offer whatever is available there. However, TF-IDF already implicitly considers stopwords, so the functionality is already implemented. For POS tagging, we would rely on Stanford POS tagger, that already has pre-trained models available.
The main issue was with normalization. While English can do without lemmatization and stemming for simple tasks, morphologically rich languages, such as Slovenian, perform poorly on un-normalized tokens. Cases and declensions present a problem for natural language processing, so we wanted to provide a tool for normalization in many different languages. Luckily, we found UDPipe, a Czech initiative that offers trained lemmatization models for 50 languages. UDPipe is actually a preprocessing pipeline and we are already thinking about how to bring all of its functionality to Orange, but let us talk a bit about the recent improvements for normalization.
Let us load a simple corpus from Corpus widget, say grimm-tales-selected.tab that contain 44 tales from the Grimm Brothers. Now, pass them through Preprocess Text and keep just the defaults, namely lowercase transformation, tokenization by words, and removal of stopwords. Here we see that we have came as quite a frequent word and come as a bit less frequent. But semantically, they are the same word from the verb to come. Shouldn’t we consider them as one word?
We can. This is what normalization does – it transforms all words into their lemmas or basic grammatical form. Came and come will become come, sons and son will become son, pretty and prettier will become pretty. This will result in less tokens that capture the text better, semantically speaking.
We can see that came became come with 435 counts. Went became go. Said became say. And so on. As we said, this doesn’t work only on verbs, but on all word forms.
One thing to note here. UDPipe has an internal tokenizer, that works with sentences instead of tokens. You can enable it by selecting UDPipe tokenizer option. What is the difference? A quicker version would be to tokenize all the words and just look up their lemma. But sometimes this can be wrong. Consider the sentence:
I am wearing a tie to work.
Now the word tie is obviously a piece of clothing, which is indicated by the word wearing before it. But tie alone can also be the verb to tie. So the UDPipe tokenizer will consider the entire sentence and correctly lemmatize this word, while lemmatization on regular tokens might not. While UDPipe works better, it is also slower, so you might want to work with regular tokenization to speed up the analysis.
Finally, UDPipe does not remove punctuation, so you might end up with words like rose. and away., with the full stop at the end. This you can fix with using regular tokenization and also by select the Regex option in Filtering, which will remove pure punctuation.
This is it. UDPipe contains lemmatization models for 50 languages and only when you click on a particular language in the Language option, will the resource be loaded, so your computer won’t be flooded with models for languages you won’t ever use. The installation of UDPipe could also be a little tricky, but after some initial obstacles, we have managed to prepare packages for both pip (OSX and Linux) and conda (Windows).
We hope you enjoy the new possibilities of a freshly multilingual Orange!
In the past month, we had two workshops that focused on text mining. The first one, Faksi v praksi, was organized by the University of Ljubljana Career Centers, where high school students learned about what we do at the Faculty of Computer and Information Science. We taught them what text mining is and how to group a collection of documents in Orange. The second one took on a more serious note, as the public sector employees joined us for the third set of workshops from the Ministry of Public Affairs. This time, we did not only cluster documents, but also built predictive models, explored predictions in nomogram, plotted documents on a map and discovered how to find the emotion in a tweet.
These workshops gave us a lot of incentive to improve the Text add-on. We really wanted to support more languages and add extra functionalities to widgets. In the upcoming week, we will release the 0.5.0 version, which introduces support for Slovenian in Sentiment Analysis widget, adds concordance output option to Concordances and, most importantly, implements UDPipe lemmatization, which means Orange will now support about 50 languages! Well, at least for normalization. 😇
Today, we will briefly introduce sentiment analysis for Slovenian. We have added the KKS 1.001 opinion corpus of Slovene web commentaries, which is a part of the CLARIN infrastructure. You can access it in the Corpus widget. Go to Browse documentation corpora and look for slo-opinion-corpus.tab. Let’s have a quick view in a Corpus Viewer.
The data comes from comment sections of Slovenian online media and contains a fairly expressive language. Let us observe, whether a post is negative or positive. We will use Sentiment Analysis widget and select the Liu Hu method for Slovenian. This is a dictionary based method, where the algorithm sums the positive words and subtracts the sum of negative words. This gives a final score of the post.
We will have to adjust the attributes for a nicer view in a Select Columns widget. Remove all attributes other than sentiment.
Finally, we can observe the results in a Heat Map. The blue lines are the negative posts, while the yellow ones are positive. Let us select the most positive tweets and see, what they are about.
Looks like Slovenians are happy, when petrol gets cheaper and sports(wo)men are winning. We can relate.
Of course, there are some drawbacks of lexicon-based methods. Namely, they don’t work well with phrases, they often don’t consider modern language (see ‘Jupiiiiiii’ or ‘Hooooooraaaaay!’, where the more the letters, the more expressive the word is) and they fail with sarcasm. Nevertheless, even such crude methods give us a nice glimpse into the corpus and enable us to extract interesting documents.
Stay tuned for the information on the release date and the upcoming post on UDPipe infrastructure!
How do you explain text mining in 3 hours? Is it even possible? Can someone be ready to build predictive models and perform clustering in a single afternoon?
It seems so, especially when Orange is involved.
Yesterday, on August 7, we held a 3-hour workshop on text mining and text analysis for a large crowd of esteemed researchers at Digital Humanities 2017 in Montreal, Canada. Surely, after 3 hours everyone was exhausted, both the audience and the lecturers. But at the same time, everyone was also excited. The audience about the possibilities Orange offers for their future projects and the lecturers about the fantastic participants who even during the workshop were already experimenting with their own data.
The biggest challenge was presenting the inner workings of algorithms to a predominantly non-computer science crowd. Luckily, we had Tree Viewer and Nomogram to help us explain Classification Tree and Logistic Regression! Everything is much easier with vizualizations.
At the end, we were experimenting with explorative data analysis, where we had Hierarchical Clustering, Corpus Viewer, Image Viewer and Geo Map opened at the same time. This is how a researcher can interactively explore the dendrogram, read the documents from selected clusters, observe the corresponding images and locate them on a map.
The workshop was a nice kick-off to an exciting week full of interesting lectures and presentations at Digital Humanities 2017 conference. So much to learn and see!
As always, we’ve been working hard to bring you new functionalities and improvements. Recently, we’ve released Orange version 3.4.5 and Orange3-Text version 0.2.5. We focused on the Text add-on since we are lately holding a lot of text mining workshops. The next one will be at Digital Humanities 2017 in Montreal, QC, Canada in a couple of days and we simply could not resist introducing some sexy new features.
First, Orange 3.4.5 offers better support for Text add-on. What do we mean by this? Now, every core Orange widget works with Text smoothly so you can mix-and-match the widgets as you like. Before, one could not pass the output of Select Columns (data table) to Preprocess Text (corpus), but now this is no longer a problem.
Of course, one still needs to keep in mind that Corpus is a sparse data format, which does not work with some widgets by design. For example, Manifold Learning supports only t-SNE projection.
Second, we’ve introduced two new widgets, which have been long overdue. One is Sentiment Analysis, which enables basic sentiment analysis of corpora. So far it works for English and uses two nltk-supported techniques – Liu Hu and Vader. Both techniques are lexicon-based. Liu Hu computes a single normalized score of sentiment in the text (negative score for negative sentiment, positive for positive, 0 is neutral), while Vader outputs scores for each category (positive, negative, neutral) and appends a total sentiment score called a compound.
Try it with Heat Map to visualize the scores.
The second widget we’ve introduced is Import Documents. This widget enables you to import your own documents into Orange and outputs a corpus on which you can perform the analysis. The widget supports .txt, .docx, .odt, .pdf and .xml files and loads an entire folder. If the folder contains subfolders, they will be considered as class values. Here’s an example.
This is the structure of my Kennedy folder. I will load the folder with Import Documents. Observe, how Orange creates a class variable category with post-1962 and pre-1962 as class values.
Now you can perform your analysis as usual.
Finally, some widgets have cool new updates. Topic Modelling, for example, colors words by their weights – positive weights are colored green and negative red. Coloring only works with LSI, since it’s the only method that outputs both positive and negative weights.
Take some time, explore these improvements and let us know if you are happy with the changes! You can also submit new feature requests to our issue tracker.
So what does preprocessing do? Let’s have a look at an example. Place Corpus widget from Text add-on on the canvas. Open it and load Grimm-tales-selected. As always, first have a quick glance of the data in Corpus Viewer. This data set contains 44 selected Grimms’ tales.
Now, let us see the most frequent words of this corpus in a Word Cloud.
Ugh, what a mess! The most frequent words in these texts are conjunctions (‘and’, ‘or’) and prepositions (‘in’, ‘of’), but so they are in almost every English text in the world. We need to remove these frequent and uninteresting words to get to the interesting part. We remove the punctuation by defining our tokens. Regexp\w+ will keep full words and omit everything else. Next, we filter out the uninteresting words with a list of stopwords. The list is pre-set by nltk package and contains frequently occurring conjunctions, prepositions, pronouns, adverbs and so on.
Ok, we did some essential preprocessing. Now let us observe the results.
This does look much better than before! Still, we could be a bit more precise. How about removing the words could, would, should and perhaps even said, since it doesn’t say much about the content of the tale? A custom list of stopwords would come in handy!
Open a plain text editor, such as Notepad++ or Sublime, and place each word you wish to filter on a separate line.
Save the file and load it next to the pre-set stopword list.
One final check in the Word Cloud should reveal we did a nice job preparing our data. We can now see the tales talk about kings, mothers, fathers, foxes and something that is little. Much more informative!
Yesterday was no ordinary day at the Faculty of Computer and Information Science, University of Ljubljana – there was an unusually high proportion of Social Sciences students, researchers and other professionals in our classrooms. It was all because of a Text Analysis for Social Scientists workshop.
Text mining is becoming a popular method across sciences and it was time to showcase what it (and Orange) can do. In this 5-hour hands-on workshop we explained text preprocessing, clustering, and predictive models, and applied them in the analysis of selected Grimm’s Tales. We discovered that predictive models can nicely distinguish between animal tales and tales of magic and that foxes and kings play a particularly important role in separating between the two types.
The second part of the workshop was dedicated to the analysis of tweets – we learned how to work with thousands of tweets on a personal computer, we plotted them on a map by geolocation, and used Instagram images for image clustering.
Five hours was very little time to cover all the interesting topics in text analytics. But Orange came to the rescue once again. Interactive visualization and the possibility of close reading in Corpus Viewer were such a great help! Instead of reading 6400 tweets ‘by hand’, now the workshop participants can cluster them in interesting groups, find important words in each cluster and plot them in a 2D visualization.
Here, we’d like to thank NumFocus for providing financial support for the course. This enabled us to bring in students from a wide variety of fields (linguists, geographers, marketers) and prove (once again) that you don’t have to be a computer scientists to do machine learning!
Being a political scientist, I did not even hear about data mining before I’ve joined Biolab. And naturally, as with all good things, data mining started to grow on me. Give me some data, connect a bunch of widgets and see the magic happen!
But hold on! There are still many social scientists out there who haven’t yet heard about the wonderful world of data mining, text mining and machine learning. So I’ve made it my mission to spread the word. And that was the spirit that led me back to my former university – School of Political Sciences, University of Bologna.
University of Bologna is the oldest university in the world and has one of the best departments for political sciences in Europe. I held a lecture Digital Research – Data Mining for Political Scientists for MIREES students, who are specializing in research and studies in Central and Eastern Europe.
The main goal of the lecture was to lay out the possibilities that contemporary technology offers to researchers and to showcase a few simple text mining tasks in Orange. We analysed Trump’s and Clinton’s Twitter timeline and discovered that their tweets are highly distinct from one another and that you can easily find significant words they’re using in their tweets. Moreover, we’ve discovered that Trump is much better at social media than Clinton, creating highly likable and shareable content and inventing his own hashtags. Could that be a tell-tale sign of his recent victory?
Perhaps. Our future, data-mining savvy political scientists will decide. Below, you can see some examples of the workflows presented at the workshop.
Orange3-Text has just recently been polished, updated and enhanced! Our GSoC student Alexey has helped us greatly to achieve another milestone in Orange development and release the latest 0.2.0 version of our text mining add-on. The new release, which is already available on PyPi, includes Wikipedia and SimHash widgets and a rehaul of Bag of Words, Topic Modeling and Corpus Viewer.
Wikipedia widget allows retrieving sources from Wikipedia API and can handle multiple queries. It serves as an easy data gathering source and it’s great for exploring text mining techniques. Here we’ve simply queried Wikipedia for articles on Slovenia and Germany and displayed them in Corpus Viewer.
Similarity Hashing widget computes similarity hashes for the given corpus, allowing the user to find duplicates, plagiarism or textual borrowing in the corpus. Here’s an example from Wikipedia, which has a pre-defined structure of articles, making our corpus quite similar. We’ve used Wikipedia widget and retrieved 10 articles for the query ‘Slovenia’. Then we’ve used Similarity Hashing to compute hashes for our text. What we got on the output is a table of 64 binary features (predefined in the SimHash widget), which denote a 64-bit hash size. Then we computed similarities in text by sending Similarity Hashing to Distances. Here we’ve selected cosine row distances and sent the output to Hierarchical Clustering. We can see that we have some similar documents, so we can select and inspect them in Corpus Viewer.
We can inspect distances between the topics with Distances (cosine) and Hierarchical Clustering. Seems like topics are not extremely author specific, since Hierarchical Clustering often puts Trump and Clinton in the same cluster. We’ve used Average linkage, but you can play around with different linkages and see if you can get better results.
Now we connect Corpus Viewer to Preprocess Text. This is nothing new, but Corpus Viewer now displays also tokens and POS tags. Enable POS Tagger in Preprocess Text. Now open Corpus Viewer and tick the checkbox Show Tokens & Tags. This will display tagged token at the bottom of each document.
This is just a brief overview of what one can do with the new Orange text mining functionalities. Of course, these are just exemplary workflows. If you did textual analysis with great results using any of these widgets, feel free to share it with us! 🙂