Text Preprocessing

In data mining, preprocessing is key. And in text mining, it is the key and the door. In other words, it’s the most vital step in the analysis.

Related: Text Mining add-on

So what does preprocessing do? Let’s have a look at an example. Place Corpus widget from Text add-on on the canvas. Open it and load Grimm-tales-selected. As always, first have a quick glance of the data in Corpus Viewer. This data set contains 44 selected Grimms’ tales.

Now, let us see the most frequent words of this corpus in a Word Cloud.

Ugh, what a mess! The most frequent words in these texts are conjunctions (‘and’, ‘or’) and prepositions (‘in’, ‘of’), but so they are in almost every English text in the world. We need to remove these frequent and uninteresting words to get to the interesting part. We remove the punctuation by defining our tokens. Regexp \w+ will keep full words and omit everything else. Next, we filter out the uninteresting words with a list of stopwords. The list is pre-set by nltk package and contains frequently occurring conjunctions, prepositions, pronouns, adverbs and so on.

Ok, we did some essential preprocessing. Now let us observe the results.

This does look much better than before! Still, we could be a bit more precise. How about removing the words could, would, should and perhaps even said, since it doesn’t say much about the content of the tale? A custom list of stopwords would come in handy!

Open a plain text editor, such as Notepad++ or Sublime, and place each word you wish to filter on a separate line.

Save the file and load it next to the pre-set stopword list.

One final check in the Word Cloud should reveal we did a nice job preparing our data. We can now see the tales talk about kings, mothers, fathers, foxes and something that is little. Much more informative!

Related: Workshop: Text Analysis for Social Scientists

  • Rui Liu

    Hi! I’m wondering how to activate de fonction of normalization, N-gram range, etc.? (It is marked “disabled”)

  • Tasty minerals

    Is there a way to retrieve the word counts (wight / word table on the left) that word cloud widget computes?
    Also, how to retrieve cleaned up and normalized text from “Text Preprocess” widget?

  • Ioannis Thibaios

    Hello, i loaded the Election-2016-Tweets.tab and i make the following flow:
    Corpus(Election-2016-Tweets.tab) –> Preprocessing –> Bag of Words
    In order to make a predictive Model i used the ‘SVM’ model and ‘Test-Score’ evaluate method and then there is an error.
    With ‘Naive Bayes’ method there is no problem.
    This error should be indicated, cause there is lot of instances to be processed(?)
    If anyone has idea for this please help me.
    Thanks in advance!

    • Ajda Pretnar

      I think the issue might be a large sparse matrix, but we need to give a deeper look into it. Please report the error to our issue tracker: https://github.com/biolab/orange3/issues and we’ll try to figure it out. Thanks! 🙂

  • Tasty minerals

    First, where is this Grim-tales-selected.tab dataset (not found in datasets/ dir)?
    Second, how to create your own dataset and what format should it have in order to be processed via orange-canvas.

    • Ajda Pretnar

      It definitely should be in the datasets dir. If it’s not, you probably need to update Orange3-Text to the latest version.
      For the second part see my answer to Anderas.

      • Ajda Pretnar

        Also, the dataset is directly available in the Corpus widget.

      • Tasty minerals

        My mistake, I found it via “Corpus” widget. It appears that text datasets are under “.local/lib/python3.6/site-packages/orangecontrib/text/datasets”.

  • Andreas Kellerhals

    Thats fine while working with Grimm’s tale. However, I have not found any indication about how to create a corpus with my own set of texts. Some help would be appreciated,

    Andreas

    • Ajda Pretnar

      Orange works with any tab-delimited, .csv and Excel files, which you can load into the Corpus widget as you would in the File widget. However, we are preparing a new widget for text import and there will be a video tutorial along with it, so stay tuned!