Data Mining for Political Scientists

Being a political scientist, I did not even hear about data mining before I’ve joined Biolab. And naturally, as with all good things, data mining started to grow on me. Give me some data, connect a bunch of widgets and see the magic happen!

But hold on! There are still many social scientists out there who haven’t yet heard about the wonderful world of data mining, text mining and machine learning. So I’ve made it my mission to spread the word. And that was the spirit that led me back to my former university – School of Political Sciences, University of Bologna.

University of Bologna is the oldest university in the world and has one of the best departments for political sciences in Europe. I held a lecture Digital Research – Data Mining for Political Scientists for MIREES students, who are specializing in research and studies in Central and Eastern Europe.

Lecture at University of Bologna
Lecture at University of Bologna

The main goal of the lecture was to lay out the possibilities that contemporary technology offers to researchers and to showcase a few simple text mining tasks in Orange. We analysed Trump’s and Clinton’s Twitter timeline and discovered that their tweets are highly distinct from one another and that you can easily find significant words they’re using in their tweets. Moreover, we’ve discovered that Trump is much better at social media than Clinton, creating highly likable and shareable content and inventing his own hashtags. Could that be a tell-tale sign of his recent victory?

Perhaps. Our future, data-mining savvy political scientists will decide. Below, you can see some examples of the workflows presented at the workshop.

bologna-workflow1
Author predictions from Tweet content. Logistic Regression reports on 92% classification accuracy and AUC score. Confusion Matrix can output misclassified tweets to Corpus Viewer, where we can inspect these tweets further.

 

bologna-wordcloud
Word Cloud from preprocessed tweets. We removed stopwords and punctuation to find frequencies for meaningful words only.

 

bologna-enrichment
Word Enrichment by Author. First we find Donald’s tweets with Select Rows and then compare them to the entire corpus in Word Enrichment. The widget outputs a ranked list of significant words for the provided subset. We do the same for Hillary’s tweets.

 

bologna-topicmodelling
Finding potential topics with LDA.

 

bologna-emotions
Finally, we offered a sneak peek of our recent Tweet Profiler widget. Tweet Profiler is intended for sentiment analysis of tweets and can output classes. probabilities and embeddings. The widget is not yet officially available, but will be included in the upcoming release.

Celebrity Lookalike or How to Make Students Love Machine Learning

Recently we’ve been participating at Days of Computer Science, organized by the Museum of Post and Telecommunications and the Faculty of Computer and Information Science, University of Ljubljana, Slovenia. The project brought together pupils and students from around the country and hopefully showed them what computer science is mostly about. Most children would think programming is just typing lines of code. But it’s more than that. It’s a way of thinking, a way to solve problems creatively and efficiently. And even better, computer science can be used for solving a great variety of problems.

Related: On teaching data science with Orange

Orange team has prepared a small demo project called Celebrity Lookalike. We found 65 celebrity photos online and loaded them in Orange. Next we cropped photos to faces and turned them black and white, to avoid bias in background and color. Next we inferred embeddings with ImageNet widget and got 2048 features, which are the penultimate result of the ImageNet neural network.

We find faces in photos and turn them to black and white. This eliminates the effect of background and distinct colors for embeddings.
We find faces in photos and turn them to black and white. This eliminates the effect of the background and distinct colors for embeddings.

 

Still, we needed a reference photo to find the celebrity lookalike for. Students could take a selfie and similarly extracted black and white face out of it. Embeddings were computed and sent to Neighbors widget. Neighbors finds n closest neighbors based on the defined distance measure to the provided reference. We decided to output 10 closest neighbors by cosine distance.

workflow111
Celebrity Lookalike workflow. We load photos, find faces and compute embeddings. We do the same for our Webcam Capture. Then we find 10 closest neighbors and observe the results in Lookalike widget.

 

Finally, we used Lookalike widget to display the result. Students found it hilarious when curly boys were the Queen of England and girls with glasses Steve Jobs. They were actively trying to discover how the algorithm works by taking photo of a statue, person with or without glasses, with hats on or by making a funny face.

lookalike6181683

Hopefully this inspires a new generation of students to become scientists, researchers and to actively find solutions to their problems. Coding or not. 🙂

dsc_4982

Note: Most widgets we have designed for this projects (like Face Detector, Webcam Capture, and Lookalike) are available in Orange3-Prototypes and are not actively maintained. They can, however, be used for personal projects and sheer fun. Orange does not own the copyright of the images.

Top 100 Changemakers in Central and Eastern Europe

Recently Orange and one of its inventors, Blaž Zupan, have been recognized as one of the top 100 changemakers in the region. A 2016 New Europe 100 is an annual list of innovators and entrepreneurs in Central and Eastern Europe highlighting novel approaches to pressing problems.

Orange has been recognized for making data more approachable, which has been our goal from the get-go. The tool is continually being developed with the end user in mind – someone who wants to analyze his/her data quickly, visually, interactively, and efficiently. We’re always thinking hard how to expose valuable information in the data, how to improve the user experience, which defaults are the most appropriate for the method, and, finally, how to intuitively teach people about data mining.

This nomination is a great validation of our efforts and it only makes us work harder. Because every research should be fruitful and fun!

adv_data_mining-02

Orange at Eurostat’s Big Data Workshop

Eurostat’s Big Data Workshop recently took place in Ljubljana. In a presentation we have showcased Orange as a tool to teach data science.

The meeting was organised by Statistical Office of Slovenia and by Eurostat, a Statistical Office of the European Union, and was a primary gathering of representatives from national statistical institutes joined within European Statistical System. The meeting discussed possibilities that big data offers to modern statistics and the role it could play in statistical offices around the world. Say, can one use twitter data to measure costumer satisfaction? Or predict employment rates? Or use traffic information to predict GDP?

During the meeting, Philippe Nieuwbourg from Canada pointed out that the stack of tools for big data analysis, and actually the tool stack for data science, are rather big and are growing larger each day. There is no way that data owners can master data bases, warehouses, Python, R, web development stacks, and similar. Are we alienating the owners and users from their own sources of information?

Of course not. We were invited to the workshop to show that there are data science tools that can actually connect users and data, and empower the users to explore the data in the ways they have never dreamed before. We claimed that these tools should

  • spark the intuition,
  • offer powerful and interactive visualizations,
  • and offer flexibility in design of analysis workflows, say, through visual programming.

Related: Teaching data science with Orange

We claimed that with such tools, it takes only a few days to train users to master basic and intermediate concepts of data science. And we claimed that this could be done without diving into complex mathematics and statistics.

ess-big-data-for-owners

Part of our presentation was a demo in Orange that showed few tricks we use in such training. The presentation  included:

  • a case study of interactive data exploration by building and visualizing classification tree and forests, and mapping parts of the model to the projection in a scatter plot,
  • a demo how fun it is to draw a data set and then use it to teach about clustering,
  • a presentation how trained deep model can be used to explore and cluster images.

Related: Data Mining Course at Baylor College of Medicine in Houston

ess-big-data-cow

The Eurostat meeting was very interesting and packed with new ideas. Our thanks to Boro Nikić for inviting us, and thanks to attendees of our session for the many questions and requests we have received during presentation and after the meeting.