Text Analysis: New Features

As always, we’ve been working hard to bring you new functionalities and improvements. Recently, we’ve released Orange version 3.4.5 and Orange3-Text version 0.2.5. We focused on the Text add-on since we are lately holding a lot of text mining workshops. The next one will be at Digital Humanities 2017 in Montreal, QC, Canada in a couple of days and we simply could not¬†resist introducing some sexy new features.

Related: Text Preprocessing

Related: Rehaul of Text Mining Add-On

First, Orange 3.4.5 offers better support for Text add-on. What do we mean by this? Now, every core Orange widget works with Text smoothly so you can mix-and-match the widgets as you like. Before, one could not pass the output of Select Columns (data table) to Preprocess Text (corpus), but now this is no longer a problem.

Of course, one still needs to keep in mind that Corpus is a sparse data format, which does not work with some widgets by design. For example, Manifold Learning supports only t-SNE projection.

 

Second, we’ve introduced two new widgets, which have been long overdue. One is Sentiment Analysis, which enables basic sentiment analysis of corpora. So far it works for English and uses two nltk-supported techniques – Liu Hu and Vader. Both techniques are lexicon-based. Liu Hu computes a single normalized score of sentiment in the text (negative score for negative sentiment, positive for positive, 0 is neutral), while Vader outputs scores for each category (positive, negative, neutral) and appends a total sentiment score called a compound.

Liu Hu score.
Vader scores.

 

Try it with Heat Map to visualize the scores.

Yellow represent a high, positive score, while blue represent a low, negative score. Seems like Animal Tales are generally much more negative than Tales of Magic.

 

The second widget we’ve introduced is Import Documents. This widget enables you to import your own documents into Orange and outputs a corpus on which you can perform the analysis. The widget supports .txt, .docx, .odt, .pdf and .xml files and loads an entire folder. If the folder contains subfolders, they will be considered as class values. Here’s an example.

This is the structure of my Kennedy folder. I will load the folder with Import Documents. Observe, how Orange creates a class variable category with post-1962 and pre-1962 as class values.

Subfolders are considered as class in the category column.

 

Now you can perform your analysis as usual.

 

Finally, some widgets have cool new updates. Topic Modelling, for example, colors words by their weights – positive weights are colored green and negative red. Coloring only works with LSI, since it’s the only method that outputs both positive and negative weights.

If there are many kings in the text and no birds, then the text belongs to Topic 2. If there are many children and no foxes, then it belongs to Topic 3.

 

Take some time, explore these improvements and let us know if you are happy with the changes! You can also submit new feature requests to our issue tracker.

 

Thank you for working with Orange! ūüćä

Data Mining for Political Scientists

Being a political scientist, I did not even hear about data mining before I’ve joined Biolab. And naturally, as with all good things, data mining started to grow on me. Give me some data, connect a bunch of widgets and see the magic happen!

But hold on! There are still many social scientists out there who haven’t yet heard about the wonderful world of data mining, text mining and machine learning. So I’ve made it my mission to spread the word. And that was the spirit that led me back to my former university – School of Political Sciences, University¬†of Bologna.

University of Bologna is the oldest university in the world and has one of the best departments for political sciences in Europe. I held a lecture Digital Research РData Mining for Political Scientists for MIREES students, who are specializing in research and studies in Central and Eastern Europe.

Lecture at University of Bologna
Lecture at University of Bologna

The main goal of the lecture was to lay out the possibilities that contemporary technology offers to researchers and to showcase a few simple text mining tasks in Orange. We analysed Trump’s and Clinton’s Twitter timeline and discovered that their tweets are highly distinct from one another and that you can easily find significant words they’re using in their tweets. Moreover, we’ve discovered that Trump is much better at social media than Clinton, creating highly likable and shareable content and inventing his own hashtags. Could that be a tell-tale sign of his recent victory?

Perhaps. Our future, data-mining savvy political scientists will decide. Below, you can see some examples of the workflows presented at the workshop.

bologna-workflow1
Author predictions from Tweet content. Logistic Regression reports on 92% classification accuracy and AUC score. Confusion Matrix can output misclassified tweets to Corpus Viewer, where we can inspect these tweets further.

 

bologna-wordcloud
Word Cloud from preprocessed tweets. We removed stopwords and punctuation to find frequencies for meaningful words only.

 

bologna-enrichment
Word Enrichment by Author. First we find Donald’s tweets with Select Rows and then compare them to the entire corpus in Word Enrichment. The widget outputs a ranked list of significant words for the provided subset. We do the same for Hillary’s tweets.

 

bologna-topicmodelling
Finding potential topics with LDA.

 

bologna-emotions
Finally, we offered a sneak peek of our recent Tweet Profiler widget. Tweet Profiler is intended for sentiment analysis of tweets and can output classes. probabilities and embeddings. The widget is not yet officially available, but will be included in the upcoming release.