Text Mining: version 0.2.0

Orange3-Text has just recently been polished, updated and enhanced! Our GSoC student Alexey has helped us greatly to achieve another milestone in Orange development and release the latest 0.2.0 version of our text mining add-on. The new release, which is already available on PyPi, includes Wikipedia and SimHash widgets and a rehaul of Bag of Words, Topic Modeling and Corpus Viewer.

 

Wikipedia widget allows retrieving sources from Wikipedia API and can handle multiple queries. It serves as an easy data gathering source and it’s great for exploring text mining techniques. Here we’ve simply queried Wikipedia for articles on Slovenia and Germany and displayed them in Corpus Viewer.

wiki1
Query Wikipedia by entering your query word list in the widget. Put each query on a separate line and run Search.

 

Similarity Hashing widget computes similarity hashes for the given corpus, allowing the user to find duplicates, plagiarism or textual borrowing in the corpus. Here’s an example from Wikipedia, which has a pre-defined structure of articles, making our corpus quite similar. We’ve used Wikipedia widget and retrieved 10 articles for the query ‘Slovenia’. Then we’ve used Similarity Hashing to compute hashes for our text. What we got on the output is a table of 64 binary features (predefined in the SimHash widget), which denote a 64-bit hash size. Then we computed similarities in text by sending Similarity Hashing to Distances. Here we’ve selected cosine row distances and sent the output to Hierarchical Clustering. We can see that we have some similar documents, so we can select and inspect them in Corpus Viewer.

simhash1
Output of Similarity Hashing widget.
simhash
We’ve selected the two most similar documents in Hierarchical Clustering and displayed them in Corpus Viewer.

 

Topic Modeling now includes three modeling algorithms, namely Latent Semantic Indexing (LSP), Latent Dirichlet Allocation (LDA), and Hierarchical Dirichlet Process (HDP). Let’s query Twitter for the latest tweets from Hillary Clinton and Donald Trump. First we preprocess the data and send the output to Topic Modeling. The widget suggests 10 topics, with the most significant words denoting each topic, and outputs topic probabilities for each document.

We can inspect distances between the topics with Distances (cosine) and Hierarchical Clustering. Seems like topics are not extremely author specific, since Hierarchical Clustering often puts Trump and Clinton in the same cluster. We’ve used Average linkage, but you can play around with different linkages and see if you can get better results.

topic-modelling
Example of comparing text by topics.

 

Now we connect Corpus Viewer to Preprocess Text. This is nothing new, but Corpus Viewer now displays also tokens and POS tags. Enable POS Tagger in Preprocess Text. Now open Corpus Viewer and tick the checkbox Show Tokens & Tags. This will display tagged token at the bottom of each document.

corpusviewer
Corpus Viewer can now display tokens and POS tags below each document.

 

This is just a brief overview of what one can do with the new Orange text mining functionalities. Of course, these are just exemplary workflows. If you did textual analysis with great results using any of these widgets, feel free to share it with us! 🙂

  • Tom Novak

    I have a question about the Similarity Hashtag widget. Is the analysis based upon patterns of individual single characters (including blanks) in the text string, or is the analysis based upon patterns of words/tokens in the text string as defined by the preprocessing widget? thanks!

    • Ajda Pretnar

      SimHash takens tokens and computes hashes based on the provided shingle length. So yes, it does take (preprocessed) tokens into account. If tokens have not yet been created, it runs the default WordPunct tokenizer from nltk.

  • Tom Novak

    A question about Orange 3 Text Mining. Should the two analyses using widgets product the same results? In both cases I begin by preprocessing text to select the 100 most frequent tokens from 20,000 short documents.

    Analysis 1: Topic Modelling Widget (LDA 20 topics)

    Analysis 2: Bag of Words Widget (Term Frequency = Count, Document Frequency = None, Regularization = None) and then Topic Modeling Widget (LDA 20 topics)

    The two analyses are producing different results. The topic keywords for the 20 topics are somewhat similar but not the same, and the 20 weights for each document are different as well.

    Is there a way to select settings for the Bag of Words options that will produce identical topic model results to an analysis that does not begin with using Bag of Words?

    • Ajda Pretnar

      Dear Tom,
      Gensim’s LDA is not deterministic, which means it starts from a random seed and then generates topics. I am also getting different results each time I’m running LDA.
      It’s a Gensim thing.
      So yeah, it has nothing to do with BoW. 🙂

  • Tom Novak

    Hi, when I use the Topic Modelling widget (LDA) and then the Save Data widget, the data is saved in .pickle format. Is there a way to save the data in .csv format instead?

    When I use Corpus Viewer to view the results of the Topic Modelling widget, I can see the weights for each of the 20 topics I have generated, for each of the 20,000 rows of my data. I am trying to produce a 20,000 row x 20 topic column .csv file of these weights, but I get a .pickle file instead. Is there a way to get a .csv file using Orange widgets?

    • Ajda Pretnar

      Dear Tom, Orange does not export to .csv. It can export to tab-delimited files (.tab), but not for sparse data, which is the case in Text.

      • Tom Novak

        Tab delimited might be nice to have as an option, even for sparse data. Are there scripts anyone has to convert the unpickled sparse data to a standard flat file?

        • Ajda Pretnar

          You can pass the data to Python Script widget and use:
          out_data = in_data.copy()
          out_data.X = out_data.X.toarray()
          Then you should be able to save the output in any format available in Orange.
          (Be careful to use the right input-output name though! out_data != out_object)

          • Tom Novak

            Thanks Ajda – your suggestion works perfectly and gives me exactly what I need!

  • Bunyamin Ozaydin

    Hello,
    Is there a way to manually introduce sets of synonyms in text mining? What I am looking for a straightforward way to tell Orange what the synonyms of some content-specific concepts are the way I can easily set the stop-words or lexicon.
    Thanks in advance for your guidance.

    • Ajda Pretnar

      Dear Bunzamin,
      Unfortunately, Orange doesn’t have this functionality yet, but it is a great idea! If you know any coding, we’d be happy to review and accept your pull request!

      • Bunyamin Ozaydin

        Thanks for the reply Ajda. Unfortunately, I do not have much coding experience to lead this effort.

  • Saahil Agrrwal

    Hello, I am working on Text analysis, where my project is to analyze news broadcast on specific topics and then to interpret results. Please guide me.

  • José Mora

    Hello, is there a way to predict topics once I have trained the LDA model?

    • Ajda Pretnar

      I’m not sure I understand. Topic Modeling already predicts topics with LDA. You can always bring in new data to predict topic probabilities (connect Topic Modeling with Data Table). Or did you want to predict discrete topics?

      • http://www.observatoriofiscal.cl José

        First of all, thanks for the quick reply!!, that’s right once I found the topics from a set of documents, I want to use it as a label to predict if one of the topics I’ve found previously is present in any of the new documents

  • Francois

    Tried to do a word cloud using the twitter widget, but all the word counts come out as 1. So the word cloud never scales words. Is there a way to concatenate the twitter messages into a single corpus instead of having them as separate records?

    • Ajda Pretnar

      So the problem no longer persists? Because Word Cloud works great for the (scaling included). Also all inputs are considered as one corpus, even though separate records are displayed as documents. I hope everything works ok for you now. 🙂

  • http://www.metodolog.pl Metodolog

    i have the same problem with installing text widget now.

  • joe

    Anaconda 4.1, Python 3.5, 64bit, W8.1

    My installation failed part way through with the message:
    Error running subprocess
    Command failed: python -m pip install
    Orange 3 exited with non zero status.

    error: Unable to find vcvarsall.bat

    The install seems to be trying to install a surprisingly large number of modules

    log at:

    https://www.dropbox.com/s/yfby9rb7bvgr63u/Document.rtf?dl=0

    • Ajda Pretnar

      The issue is compiling some C libraries that Text add-on requires. It’s a Windows thing, mainly. What you need to do is run the command: >>python setup.py build_ext -i –compiler=msvc install<<. And you need a Visual Studio compiler, which you can get here: http://landinghub.visualstudio.com/visual-cpp-build-tools.
      This should work. Btw, how wrote this terrible documentation for Text??? 🙂

      • joe

        Thank you.
        I installed VS C++ standalone v 14 for Python 3.5.

        When I tried to run setup.py from the python prompt it did not recognise setup.py.
        My Anaconda 4.1.1 seems to have some problems finding things if you have had older versions of Python3 and Python 2.7 installed. I do not know if that is the problem here.

        Should there be “<<" after " mscv install" ?

    • astaric

      The installation fails when it tries to compile the biopython module. You can try manually installing it with anaconda by running

      conda install biopython

      in the environment you have Orange installed in. After it completes, you should try installing the add-on again.

      • joe

        Thank you – this installed OK once I had installed biopython.

        There are just 2 add-ons that still fail – spark and infrared.
        Do I need to install something else first again?

        ERROR MESSAGES:

        Collecting Orange3-spark
        Using cached Orange3-spark-0.2.6.zip
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
        File “”, line 1, in
        File “C:UsersROBERT~1AppDataLocalTemppip-build-00_yrombOrange3-sparksetup.py”, line 41, in
        LONG_DESCRIPTION = open(‘README.md’).read()
        FileNotFoundError: [Errno 2] No such file or directory: ‘README.md’

        —————————————-
        Command “python setup.py egg_info” failed with error code 1 in C:UsersROBERT~1AppDataLocalTemppip-build-00_yrombOrange3-spark

        Collecting Orange-Infrared
        Using cached Orange-Infrared-0.0.6.tar.gz
        Requirement already satisfied (use –upgrade to upgrade): Orange3 in c:anaconda3libsite-packages (from Orange-Infrared)
        Requirement already satisfied (use –upgrade to upgrade): scipy>=0.14.0 in c:anaconda3libsite-packages (from Orange-Infrared)
        Collecting spectral>=0.18 (from Orange-Infrared)
        Using cached spectral-0.18.zip
        Collecting opusFC>=1.0.0b1 (from Orange-Infrared)
        Could not find a version that satisfies the requirement opusFC>=1.0.0b1 (from Orange-Infrared) (from versions: )
        No matching distribution found for opusFC>=1.0.0b1 (from Orange-Infrared)

    • Appreciate_Orange

      Just install biopython from here: http://biopython.org/wiki/Download
      Then, run the installer for the text module again.

      Had the same problem with vcvarsall.bat…

  • maa tiger

    please I can’t set up text mining in orange (2.7 or 3.3) is there any right method to install text mining ….. i still more than 7 hours searching about ways and trying install text mining but unfortunately I couldn’t do it . please help?

    • Ajda Pretnar

      This is the tutorial for 3.3, so these are your options: you can open Orange and go to Options – Add-ons and install from there. It should work out of the box. If not, I suggest you to uninstall and re-install Orange entirely.
      Another option is to open the terminal, go to your Orange virtual environment and type: pip install Orange3-Text.
      The third option is to clone the repository from Github (https://github.com/biolab/orange3/) and follow the instructions from there.
      The final resort is to write us using the contact form in the footer and we’ll try to help you.

      • SueB

        Hi Ajda. Tried pip install, get an error msg, get ‘running set.py install for bottleneck.. error.. needs visually C++ 2014 Build Tools. I have VS 2015 installed so it does not allow me to download the same. Any suggestions? Thanks, SueB

        • SueB

          Follow-up – please disregard this message. I was able to fix. A Conda Orange3 update added/fixed the bottleneck:1.20-np111py35_0 conda forge. This enabled a subsequent pip install of orange3-text without biopython. Visual C++ Build Tools not needed. Subsequent Text add-in was successful.