Text Preprocessing

In data mining, preprocessing is key. And in text mining, it is the key and the door. In other words, it’s the most vital step in the analysis.

Related: Text Mining add-on

So what does preprocessing do? Let’s have a look at an example. Place Corpus widget from Text add-on on the canvas. Open it and load Grimm-tales-selected. As always, first have a quick glance of the data in Corpus Viewer. This data set contains 44 selected Grimms’ tales.

Now, let us see the most frequent words of this corpus in a Word Cloud.

Ugh, what a mess! The most frequent words in these texts are conjunctions (‘and’, ‘or’) and prepositions (‘in’, ‘of’), but so they are in almost every English text in the world. We need to remove these frequent and uninteresting words to get to the interesting part. We remove the punctuation by defining our tokens. Regexp \w+ will keep full words and omit everything else. Next, we filter out the uninteresting words with a list of stopwords. The list is pre-set by nltk package and contains frequently occurring conjunctions, prepositions, pronouns, adverbs and so on.

Ok, we did some essential preprocessing. Now let us observe the results.

This does look much better than before! Still, we could be a bit more precise. How about removing the words could, would, should and perhaps even said, since it doesn’t say much about the content of the tale? A custom list of stopwords would come in handy!

Open a plain text editor, such as Notepad++ or Sublime, and place each word you wish to filter on a separate line.

Save the file and load it next to the pre-set stopword list.

One final check in the Word Cloud should reveal we did a nice job preparing our data. We can now see the tales talk about kings, mothers, fathers, foxes and something that is little. Much more informative!

Related: Workshop: Text Analysis for Social Scientists

19 thoughts on “Text Preprocessing

    1. You can save it with Save Data widget. Note to connect Word Counts output to Save Data. Then save the output as .tab or .csv, both of which you can see in a plain text editor.

  1. Is it possible to create a list of synonyms and use it in Preprocesse Text? Example: car, bus, metro are only a unique token, a vehicle for example.

  2. I have an issue with “preprocess text”, maybe some of you could help me out. I do not get any lists of stop words in the widget. There is an option to add a list manually but there are no pre-installed lists as in this tutorial. Thank you for any advise!

  3. Is there a way to retrieve the word counts (wight / word table on the left) that word cloud widget computes?
    Also, how to retrieve cleaned up and normalized text from “Text Preprocess” widget?

    1. 1. A new output for word counts has been added recently and will be available in the new release.
      2. Neither Text Preprocess nor Word Cloud output modified data, because the data itself is not modified. What is modified is text property ‘tokens’, which you can call with in_data.tokens in Python Script widget, should you want to output it.

  4. Hello, i loaded the Election-2016-Tweets.tab and i make the following flow:
    Corpus(Election-2016-Tweets.tab) –> Preprocessing –> Bag of Words
    In order to make a predictive Model i used the ‘SVM’ model and ‘Test-Score’ evaluate method and then there is an error.
    With ‘Naive Bayes’ method there is no problem.
    This error should be indicated, cause there is lot of instances to be processed(?)
    If anyone has idea for this please help me.
    Thanks in advance!

  5. First, where is this Grim-tales-selected.tab dataset (not found in datasets/ dir)?
    Second, how to create your own dataset and what format should it have in order to be processed via orange-canvas.

    1. It definitely should be in the datasets dir. If it’s not, you probably need to update Orange3-Text to the latest version.
      For the second part see my answer to Anderas.

      1. My mistake, I found it via “Corpus” widget. It appears that text datasets are under “.local/lib/python3.6/site-packages/orangecontrib/text/datasets”.

  6. Thats fine while working with Grimm’s tale. However, I have not found any indication about how to create a corpus with my own set of texts. Some help would be appreciated,


    1. Orange works with any tab-delimited, .csv and Excel files, which you can load into the Corpus widget as you would in the File widget. However, we are preparing a new widget for text import and there will be a video tutorial along with it, so stay tuned!

Leave a Reply