Text Analysis Workshop at Digital Humanities 2017

How do you explain text mining in 3 hours? Is it even possible? Can someone be ready to build predictive models and perform clustering in a single afternoon?

It seems so, especially when Orange is involved.

Yesterday, on August 7, we held a 3-hour workshop on text mining and text analysis for a large crowd of esteemed researchers at Digital Humanities 2017 in Montreal, Canada. Surely, after 3 hours everyone was exhausted, both the audience and the lecturers. But at the same time, everyone was also excited. The audience about the possibilities Orange offers for their future projects and the lecturers about the fantastic participants who even during the workshop were already experimenting with their own data.

The biggest challenge was presenting the inner workings of algorithms to a predominantly non-computer science crowd. Luckily, we had Tree Viewer and Nomogram to help us explain Classification Tree and Logistic Regression! Everything is much easier with vizualizations.

 

Classification Tree splits first by the word ‘came’, since it results in the purest split. Next it splits by ‘strange’. Since we still don’t have pure nodes, it continues to ‘bench’, which gives a satisfying result. Trees are easy to explain, but can quickly overfit the data.

 

Logistic Regression transforms word counts to points. The sum of points directly corresponds to class probability. Here, if you see 29 foxes in a text, you get a high probability of Animal Tales. If you don’t see any, then you get a high probability of the opposite class.

 

At the end, we were experimenting with explorative data analysis, where we had Hierarchical Clustering, Corpus Viewer, Image Viewer and Geo Map opened at the same time. This is how a researcher can interactively explore the dendrogram, read the documents from selected clusters, observe the corresponding images and locate them on a map.

Hierarchical Clustering, Image Viewer, Geo Map and Corpus Viewer opened at the same time create an interactive data browser.

 

The workshop was a nice kick-off to an exciting week full of interesting lectures and presentations at Digital Humanities 2017 conference. So much to learn and see!

 

 

  • Helal Mobasher

    Dear Orange Team,
    I am greatly enjoying learning and using Orange to analyze my company’s healthcare data.Here are couple questions regarding text analysis. I am currently performing text analysis on medical notes. Need to analyze certain words and phrases in thousands of records. The main data are in Excel and only couple columns includes medical and clinical notes/summaries. Here my two questions: 1) what’s the process of getting the excel file into text processing and how do I filter for only those two columns. Keep in mind that ultimately I need to go back to entire data and do further analysis. I learned thru few videos that I need to insert Corpus and link data first followed by text pre-processing. Then create a list of stop words to exclude and do further analysis. This leads me to my second question on stop words. The medical notes column include thousands of words and I only want to look at a few (15 to 20 max). Thus, rather than creating a huge stop words list, I can create a list of these 15 to 20 and do pre-reprocessing on this list and exclude all others? another word, I just need to look for delusion, hallucination, catatonic, etc…can I just make a list of these few and start pre-processing?

    Thank you so much for your help with the questions.

    • Ajda Pretnar

      Dear Helal, as for the first question, you could use File to load Excel table (and edit columns), then use Select Columns if you already have some attributes in your table that you wish to keep, and finally to Corpus, where you’d define the text column(s). Like this:
      https://uploads.disquscdn.com/images/e157bc00a91f35f0fa59c6a73e4a661c5c4117269d43481f2905b62c3c0a76c1.png
      As for the second question, in Preprocess Text there’s an option Lexicon that does exactly what you want – it keeps only the words from the dictionary rather than remove them. Keep in mind you have to define the words in a plain text editor, one word per line and save them as .txt file.
      If you have any other questions, please write to info@biolab.si.

      • Helal Mobasher

        Thank you so much Adja for quick reply. I will let you know if I have further questions. This app is great.

        • Helal Mobasher

          Adja, another question on text analysis is related to context in which terms or phrases used. Let me explain, if I am looking for word(s)/term(s) delusion or delusional in medical note many times the clinicians state these word son behalf of their clients. For example, clinician X indicates in the note that client Y Claims that he/she is delusional. This is in contrast to if clinician X state that his/her diagnosis of client X is delusional. Simply, context matters and not the frequency of the words itself. My questions is how to account for the context? I need to be able to distinguish between the two scenarios. Any thoughts?
          Thank you,
          Helal

          • Ajda Pretnar

            Dear Halel, there is no easy way to do this in Orange. You could try feature construction, using Python and regular expressions to find the two different meanings. You could also use lexicon in preprocessing, even though this might be even more cumbersome. If anything else, you would need embedding, which we are in the process of porting to Orange (but can’t promise when they will be in).

  • Zeeshan Xafar

    Hi Orange Team,
    Whenever I try to load a big dataset in excel format Orange stops responding.
    https://uploads.disquscdn.com/images/5e403f8e819b91b05b97b23deb9e70b48a012f166f27749cc53a4ff843c6bc3e.png

    Please Give me a solution to this problem ?
    Thanks
    Zeeshan Zafar

    • Ajda Pretnar

      Dear Zeeshan,
      There’s no solution other then upgrading your RAM. Orange can only handle so much data.

      • Zeeshan Xafar

        Okay Thanks 🙂

  • Darius

    Dear Orange team,

    I would like to ask, do you have any videos recorded from this or any other workshops on text mining with orange?

    cheers
    Darius

    • Ajda Pretnar

      Dear Darius,
      Unfortunately we did not have any resources to record the workshop. You can have a look at our YT channel, where we have some basic text analytics tutorials: https://www.youtube.com/playlist?list=PLmNPvQr9Tf-ZSDLwOzxpvY-HrE0yv-8Fy&disable_polymer=true

      • Darius

        Dear Ajda,

        thank you very much for the reply. Yeap i watched them all 🙂 It would be great to see some longer video or just slides, in order to get a better grasp what Orange could do on a bigger scale. Maybe you have a list of upcoming workshops on text analysis, which are in Europe?