Rehaul of Text Mining Add-On

Google Summer of Code is progressing nicely and some major improvements are already live! Our students have been working hard and today we’re thanking Alexey for his work on Text Mining add-on. Two major tasks before the midterms were to introduce Twitter widget and rehaul Preprocess Text. Twitter widget was designed to be a part of our summer school program and it worked beautifully. We’ve introduced youngsters to the world of data mining through social networks and one of the most exciting things was to see whether we can predict the author from the tweet content.

Twitter widget offers many functionalities. Since we wanted to get tweets from specific authors, we entered their Twitter handles as queries and set ‘Search by Author’. We only included Author, Content and Date in the query parameters, as we want to predict the author only on the basis of text.

Twitter1-stamped

  1. Provide API keys.
  2. Insert queries separated by newline.
  3. Search by content, author or both.
  4. Set date (1 week limit from tweepy module).
  5. Select language you want your tweets to be in.
  6. If ‘Max tweets’ is checked, you can set the maximum number of tweets you want to query. Otherwise the widget will provide all tweets matching the query.
  7. If ‘Accumulate results’ is checked, new queries will be appended to the old ones.
  8. Select what kind of data you want to retrieve.
  9. Tweet count.
  10. Press ‘Search’ to start your query.

We got 208 tweets on the output. Not bad. Now we need to preprocess them first, before we do any predictions. We transformed all the words into lowercase and split (tokenized) them by words. We didn’t use any normalization (below turned on just as an example) and applied a simple stopword removal.

PreprocessText1-stamped

  1. Information on the input and output.
  2. Transformation applies basic modifications of text.
  3. Tokenization splits the corpus into tokens according to the selected method (regexp is set to extract only words by default).
  4. Normalization lemmatizes words (do, did, done –> do).
  5. Filtering extracts only desired tokens (without stopwords, including only specified words, or by frequency).

Then we passed the tokens through a Bag of Words and observed the results in a Word Cloud.

wordcloud-twitter

Then we simply connected Bag of Words to Test & Score and used several classifiers to see which one works best. We used Classification Tree and Nearest Neighbors since they are easy to explain even to teenagers. Especially Classification Tree offers a nice visualization in Classification Tree Viewer that makes the idea of the algorithm easy to understand. Moreover we could observe the most distinctive words in the tree.

classtree1

Do these make any sense? You be the judge. 🙂

We checked classification results in Test&Score, counted misclassifications in Confusion Matrix and finally observed them in Corpus Viewer. k-NN seems to perform moderately well, while Classification Tree fails miserably. Still, this was trained on barely 200 tweets. Perhaps accumulating results over time might give us much better results. You can now certainly try it on your own! Update your Orange3-Text add-on or install it via ‘pip install Orange3-Text’!

schema-twitter-preprocess

Above is the final workflow. Preprocessing on the left. Testing and scoring on the right bottom. Construction of classification tree right and above.

  • Lorenzo Perone

    Hi,
    I’ve a table of donors in a crowdfunding. I’d like to draw a sort of tag wall using the name and the amount donated as weight, can I use Orange?
    Thanks.
    Lorenzo

  • okl

    Hi,
    Sorry to ask, the object of this model is to predict the author on the basis of text? am i right?

    • Ajda Pretnar

      Yes.

  • Ahmad Turmudi

    I Have problem to add Adds-on , Orange3-Text, The problem say : —————————
    Error
    —————————
    An error occurred while running a subprocess
    —————————
    Command failed: python -m pip install Orange3-Text exited with non zero status.
    —————————
    OK Hide Details…
    —————————

  • intriguing

    I think the text mining add on is great, but there is just one problem. – If you start with a MS Word document, and you wish to use Orange to produce (say) a word cloud, how do you convert your Word document into a .tab file that can be read by the Corpus widget?

    • Ajda Pretnar

      One way of doing it is to copy-paste your text into Excel and read it from there. Orange deals with tables essentially, therefore Excel is needed. In Word, the question is how would you read the text. Do you want each line as a separate instance? Or would that be one paragraph? A page or the entire text? For text mining purposes, you’d generally have to create a corpus, where you define what is one document for you and what are its characteristics (e.g. author, date created, language, etc.).
      All that said, we’re going to think about creating a text reader widget, where you could define your fields. For now, probably the easiest way of batch transforming your textual data is with a script and output it as a .csv, which is supported by Orange.

  • Rodrigo

    Ajda, for the definition files for Stop Words and Lexicon, what is the format (fields, structure, etc) necessary for Orange to consume it. Would you be able to make a sample file available please?
    Kindest Regards,
    Rodrigo

    • Ajda Pretnar

      It is a simple plain text file (.txt), with each word in a separate line. We’re about to add documentation to the add-on and the format will be explained there. Best, Ajda

      • Rodrigo

        Hi Adja, the Lexicon file format would be original word synonym in a plain text file as well? Or the other way around?
        Cheers,
        Rodrigo

        • Ajda Pretnar

          Hi, lexicon filters positively, so it only outputs tokens that match tokens in lexicon. You simply state your one-grams in .txt file, one token per line. We are still working on n-gram matching, so please bear with us.
          Thanks!
          Ajda

          • Rodrigo

            Thanks Ajda, that’s plenty already! Already working with it and I am VERY impressed. Great job!
            Cheers,