Text Analysis: New Features

As always, we’ve been working hard to bring you new functionalities and improvements. Recently, we’ve released Orange version 3.4.5 and Orange3-Text version 0.2.5. We focused on the Text add-on since we are lately holding a lot of text mining workshops. The next one will be at Digital Humanities 2017 in Montreal, QC, Canada in a couple of days and we simply could not resist introducing some sexy new features.

Related: Text Preprocessing

Related: Rehaul of Text Mining Add-On

First, Orange 3.4.5 offers better support for Text add-on. What do we mean by this? Now, every core Orange widget works with Text smoothly so you can mix-and-match the widgets as you like. Before, one could not pass the output of Select Columns (data table) to Preprocess Text (corpus), but now this is no longer a problem.

Of course, one still needs to keep in mind that Corpus is a sparse data format, which does not work with some widgets by design. For example, Manifold Learning supports only t-SNE projection.

 

Second, we’ve introduced two new widgets, which have been long overdue. One is Sentiment Analysis, which enables basic sentiment analysis of corpora. So far it works for English and uses two nltk-supported techniques – Liu Hu and Vader. Both techniques are lexicon-based. Liu Hu computes a single normalized score of sentiment in the text (negative score for negative sentiment, positive for positive, 0 is neutral), while Vader outputs scores for each category (positive, negative, neutral) and appends a total sentiment score called a compound.

Liu Hu score.
Vader scores.

 

Try it with Heat Map to visualize the scores.

Yellow represent a high, positive score, while blue represent a low, negative score. Seems like Animal Tales are generally much more negative than Tales of Magic.

 

The second widget we’ve introduced is Import Documents. This widget enables you to import your own documents into Orange and outputs a corpus on which you can perform the analysis. The widget supports .txt, .docx, .odt, .pdf and .xml files and loads an entire folder. If the folder contains subfolders, they will be considered as class values. Here’s an example.

This is the structure of my Kennedy folder. I will load the folder with Import Documents. Observe, how Orange creates a class variable category with post-1962 and pre-1962 as class values.

Subfolders are considered as class in the category column.

 

Now you can perform your analysis as usual.

 

Finally, some widgets have cool new updates. Topic Modelling, for example, colors words by their weights – positive weights are colored green and negative red. Coloring only works with LSI, since it’s the only method that outputs both positive and negative weights.

If there are many kings in the text and no birds, then the text belongs to Topic 2. If there are many children and no foxes, then it belongs to Topic 3.

 

Take some time, explore these improvements and let us know if you are happy with the changes! You can also submit new feature requests to our issue tracker.

 

Thank you for working with Orange! 🍊

Workshop: Text Analysis for Social Scientists

Yesterday was no ordinary day at the Faculty of Computer and Information Science, University of Ljubljana – there was an unusually high proportion of Social Sciences students, researchers and other professionals in our classrooms. It was all because of a Text Analysis for Social Scientists workshop.

Related: Data Mining for Political Scientists

Text mining is becoming a popular method across sciences and it was time to showcase what it (and Orange) can do. In this 5-hour hands-on workshop we explained text preprocessing, clustering, and predictive models, and applied them in the analysis of selected Grimm’s Tales. We discovered that predictive models can nicely distinguish between animal tales and tales of magic and that foxes and kings play a particularly important role in separating between the two types.

Nomogram displays 6 most important words (attributes) as defined by Logistic Regression. Seems like the occurrence of the word ‘fox’ can tell us a lot about whether the text is an animal tale or a tale of magic.

Related: Nomogram

The second part of the workshop was dedicated to the analysis of tweets – we learned how to work with thousands of tweets on a personal computer, we plotted them on a map by geolocation, and used Instagram images for image clustering.

Related: Image Analytics: Clustering

Five hours was very little time to cover all the interesting topics in text analytics. But Orange came to the rescue once again. Interactive visualization and the possibility of close reading in Corpus Viewer were such a great help! Instead of reading 6400 tweets ‘by hand’, now the workshop participants can cluster them in interesting groups, find important words in each cluster and plot them in a 2D visualization.

Participants at work.

Here, we’d like to thank NumFocus for providing financial support for the course. This enabled us to bring in students from a wide variety of fields (linguists, geographers, marketers) and prove (once again) that you don’t have to be a computer scientists to do machine learning!

 

Text Mining: version 0.2.0

Orange3-Text has just recently been polished, updated and enhanced! Our GSoC student Alexey has helped us greatly to achieve another milestone in Orange development and release the latest 0.2.0 version of our text mining add-on. The new release, which is already available on PyPi, includes Wikipedia and SimHash widgets and a rehaul of Bag of Words, Topic Modeling and Corpus Viewer.

 

Wikipedia widget allows retrieving sources from Wikipedia API and can handle multiple queries. It serves as an easy data gathering source and it’s great for exploring text mining techniques. Here we’ve simply queried Wikipedia for articles on Slovenia and Germany and displayed them in Corpus Viewer.

wiki1
Query Wikipedia by entering your query word list in the widget. Put each query on a separate line and run Search.

 

Similarity Hashing widget computes similarity hashes for the given corpus, allowing the user to find duplicates, plagiarism or textual borrowing in the corpus. Here’s an example from Wikipedia, which has a pre-defined structure of articles, making our corpus quite similar. We’ve used Wikipedia widget and retrieved 10 articles for the query ‘Slovenia’. Then we’ve used Similarity Hashing to compute hashes for our text. What we got on the output is a table of 64 binary features (predefined in the SimHash widget), which denote a 64-bit hash size. Then we computed similarities in text by sending Similarity Hashing to Distances. Here we’ve selected cosine row distances and sent the output to Hierarchical Clustering. We can see that we have some similar documents, so we can select and inspect them in Corpus Viewer.

simhash1
Output of Similarity Hashing widget.
simhash
We’ve selected the two most similar documents in Hierarchical Clustering and displayed them in Corpus Viewer.

 

Topic Modeling now includes three modeling algorithms, namely Latent Semantic Indexing (LSP), Latent Dirichlet Allocation (LDA), and Hierarchical Dirichlet Process (HDP). Let’s query Twitter for the latest tweets from Hillary Clinton and Donald Trump. First we preprocess the data and send the output to Topic Modeling. The widget suggests 10 topics, with the most significant words denoting each topic, and outputs topic probabilities for each document.

We can inspect distances between the topics with Distances (cosine) and Hierarchical Clustering. Seems like topics are not extremely author specific, since Hierarchical Clustering often puts Trump and Clinton in the same cluster. We’ve used Average linkage, but you can play around with different linkages and see if you can get better results.

topic-modelling
Example of comparing text by topics.

 

Now we connect Corpus Viewer to Preprocess Text. This is nothing new, but Corpus Viewer now displays also tokens and POS tags. Enable POS Tagger in Preprocess Text. Now open Corpus Viewer and tick the checkbox Show Tokens & Tags. This will display tagged token at the bottom of each document.

corpusviewer
Corpus Viewer can now display tokens and POS tags below each document.

 

This is just a brief overview of what one can do with the new Orange text mining functionalities. Of course, these are just exemplary workflows. If you did textual analysis with great results using any of these widgets, feel free to share it with us! 🙂