Data Mining for Political Scientists

Being a political scientist, I did not even hear about data mining before I’ve joined Biolab. And naturally, as with all good things, data mining started to grow on me. Give me some data, connect a bunch of widgets and see the magic happen!

 

But hold on! There are still many social scientists out there who haven’t yet heard about the wonderful world of data mining, text mining and machine learning. So I’ve made it my mission to spread the word. And that was the spirit that led me back to my former university – School of Political Sciences, University of Bologna.

 

University of Bologna is the oldest university in the world and has one of the best departments for political sciences in Europe. I held a lecture Digital Research – Data Mining for Political Scientists for MIREES students, who are specializing in research and studies in Central and Eastern Europe.

Lecture at University of Bologna
Lecture at University of Bologna

 

The main goal of the lecture was to lay out the possibilities that contemporary technology offers to researchers and to showcase a few simple text mining tasks in Orange. We analysed Trump’s and Clinton’s Twitter timeline and discovered that their tweets are highly distinct from one another and that you can easily find significant words they’re using in their tweets. Moreover, we’ve discovered that Trump is much better at social media than Clinton, creating highly likable and shareable content and inventing his own hashtags. Could that be a tell-tale sign of his recent victory?

 

Perhaps. Our future, data-mining savvy political scientists will decide. Below, you can see some examples of the workflows presented at the workshop.

bologna-workflow1
Author predictions from Tweet content. Logistic Regression reports on 92% classification accuracy and AUC score. Confusion Matrix can output misclassified tweets to Corpus Viewer, where we can inspect these tweets further.

 

bologna-wordcloud
Word Cloud from preprocessed tweets. We removed stopwords and punctuation to find frequencies for meaningful words only.

 

bologna-enrichment
Word Enrichment by Author. First we find Donald’s tweets with Select Rows and then compare them to the entire corpus in Word Enrichment. The widget outputs a ranked list of significant words for the provided subset. We do the same for Hillary’s tweets.

 

bologna-topicmodelling
Finding potential topics with LDA.

 

bologna-emotions
Finally, we offered a sneak peek of our recent Tweet Profiler widget. Tweet Profiler is intended for sentiment analysis of tweets and can output classes. probabilities and embeddings. The widget is not yet officially available, but will be included in the upcoming release.

Celebrity Lookalike or How to Make Students Love Machine Learning

Recently we’ve been participating at Days of Computer Science, organized by the Museum of Post and Telecommunications and the Faculty of Computer and Information Science, University of Ljubljana, Slovenia. The project brought together pupils and students from around the country and hopefully showed them what computer science is mostly about. Most children would think programming is just typing lines of code. But it’s more than that. It’s a way of thinking, a way to solve problems creatively and efficiently. And even better, computer science can be used for solving a great variety of problems.

Related: On teaching data science with Orange

Orange team has prepared a small demo project called Celebrity Lookalike. We found 65 celebrity photos online and loaded them in Orange. Next we cropped photos to faces and turned them black and white, to avoid bias in background and color. Next we inferred embeddings with ImageNet widget and got 2048 features, which are the penultimate result of the ImageNet neural network.

We find faces in photos and turn them to black and white. This eliminates the effect of background and distinct colors for embeddings.
We find faces in photos and turn them to black and white. This eliminates the effect of the background and distinct colors for embeddings.

 

Still, we needed a reference photo to find the celebrity lookalike for. Students could take a selfie and similarly extracted black and white face out of it. Embeddings were computed and sent to Neighbors widget. Neighbors finds n closest neighbors based on the defined distance measure to the provided reference. We decided to output 10 closest neighbors by cosine distance.

workflow111
Celebrity Lookalike workflow. We load photos, find faces and compute embeddings. We do the same for our Webcam Capture. Then we find 10 closest neighbors and observe the results in Lookalike widget.

 

Finally, we used Lookalike widget to display the result. Students found it hilarious when curly boys were the Queen of England and girls with glasses Steve Jobs. They were actively trying to discover how the algorithm works by taking photo of a statue, person with or without glasses, with hats on or by making a funny face.

lookalike6181683

Hopefully this inspires a new generation of students to become scientists, researchers and to actively find solutions to their problems. Coding or not. 🙂

dsc_4982

Note: Most widgets we have designed for this projects (like Face Detector, Webcam Capture, and Lookalike) are available in Orange3-Prototypes and are not actively maintained. They can, however, be used for personal projects and sheer fun. Orange does not own the copyright of the images.

Top 100 Changemakers in Central and Eastern Europe

Recently Orange and one of its inventors, Blaž Zupan, have been recognized as one of the top 100 changemakers in the region. A 2016 New Europe 100 is an annual list of innovators and entrepreneurs in Central and Eastern Europe highlighting novel approaches to pressing problems.

Orange has been recognized for making data more approachable, which has been our goal from the get-go. The tool is continually being developed with the end user in mind – someone who wants to analyze his/her data quickly, visually, interactively, and efficiently. We’re always thinking hard how to expose valuable information in the data, how to improve the user experience, which defaults are the most appropriate for the method, and, finally, how to intuitively teach people about data mining.

This nomination is a great validation of our efforts and it only makes us work harder. Because every research should be fruitful and fun!

adv_data_mining-02

Orange at Eurostat’s Big Data Workshop

Eurostat’s Big Data Workshop recently took place in Ljubljana. In a presentation we have showcased Orange as a tool to teach data science.

The meeting was organised by Statistical Office of Slovenia and by Eurostat, a Statistical Office of the European Union, and was a primary gathering of representatives from national statistical institutes joined within European Statistical System. The meeting discussed possibilities that big data offers to modern statistics and the role it could play in statistical offices around the world. Say, can one use twitter data to measure costumer satisfaction? Or predict employment rates? Or use traffic information to predict GDP?

During the meeting, Philippe Nieuwbourg from Canada pointed out that the stack of tools for big data analysis, and actually the tool stack for data science, are rather big and are growing larger each day. There is no way that data owners can master data bases, warehouses, Python, R, web development stacks, and similar. Are we alienating the owners and users from their own sources of information?

Of course not. We were invited to the workshop to show that there are data science tools that can actually connect users and data, and empower the users to explore the data in the ways they have never dreamed before. We claimed that these tools should

  • spark the intuition,
  • offer powerful and interactive visualizations,
  • and offer flexibility in design of analysis workflows, say, through visual programming.

Related: Teaching data science with Orange

We claimed that with such tools, it takes only a few days to train users to master basic and intermediate concepts of data science. And we claimed that this could be done without diving into complex mathematics and statistics.

ess-big-data-for-owners

Part of our presentation was a demo in Orange that showed few tricks we use in such training. The presentation  included:

  • a case study of interactive data exploration by building and visualizing classification tree and forests, and mapping parts of the model to the projection in a scatter plot,
  • a demo how fun it is to draw a data set and then use it to teach about clustering,
  • a presentation how trained deep model can be used to explore and cluster images.

Related: Data Mining Course at Baylor College of Medicine in Houston

ess-big-data-cow

The Eurostat meeting was very interesting and packed with new ideas. Our thanks to Boro Nikić for inviting us, and thanks to attendees of our session for the many questions and requests we have received during presentation and after the meeting.

10 Tips and Tricks for Using Orange

TIP #1: Follow tutorials and example workflows to get started.

It’s difficult to start using new software. Where does one start, especially a total novice in data mining? For this exact reason we’ve prepared Getting Started With Orange – YouTube tutorials for complete beginners. Example workflows on the other hand can be accessed via Help – Examples.

 

TIP #2: Make use of Orange documentation.

You can access it in three ways:

  1. Press F1 when the widget is selected. This will open help screen.
  2. Select Widget – Help when the widget is selected. It works the same as above.
  3. Visit online documentation.

 

TIP #3: Embed your help screen.

Drag and drop help screen to the side of your Orange canvas. It will become embedded in the canvas. You can also make it narrower, allowing for a full-size analysis while exploring the docs.

embed-help

 

TIP #4: Use right-click.

Right-click on the canvas and a widget menu will appear. Start typing the widget you’re looking for and press Enter when the widget becomes the top widget. This will place the widget onto the canvas immediately. You can also navigate the menu with up and down.

 

TIP #5: Turn off channel names.

Sometimes it is annoying to see channel names above widget links. If you’re already comfortable using Orange, you can turn them off in Options – Settings. Turn off ‘Show channel names between widgets’.

 

TIP #6: Hide control pane.

Once you’ve set the parameters, you’d probably want to focus just on visualizations. There’s a simple way to do this in Orange. Click on the split between the control pane and visualization pane – you should see a hand appearing instead of a cursor. Click and observe how the control pane gets hidden away neatly. To make it reappear, click the split again.

panel1

panel2

 

TIP #7: Label your data.

So you’ve plotted your data, but have no idea what you’re seeing. Use annotation! In some widgets you will see a drop-down menu called Annotation, while in others it will be called a Label. This will mark your data points with suitable labels, making your MDS plots and Scatter Plots much more informative. Scatter Plot also enables you to label only selected points for better clarity.

 

TIP #8: Find your plot.

Scrolled around and lost the plot? Zoomed in too much? To re-position the plot click ‘Reset zoom’ and the visualization will jump snugly into the visualization pane. Comes in handy when browsing the subsets and trying to see the bigger picture every now and then.

zoom-pan

 

 

TIP #9: Reset widget settings.

Orange is geared to remember your last settings, thus assisting you in a rapid analysis. However, sometimes you need to start anew. Go to Options – Reset widget settings… and restart Orange. This will return Orange to its original state.

 

TIP #10: Use Educational add-on.

To learn about how some algorithms work, use Orange3-Educational add-on. It contains 4 widgets that will help you get behind the scenes of some famous algorithms. And since they’re interactive, they’re also a lot of fun!

educational

 

 

 

 

The Story of Shadow and Orange

This is a long story. I remember when started my PhD in Italy. There I met a researcher and he said to me: »You should do some simulations on x-ray optics beamline.« »Yes, but how should I do that?« He gave me a big tape, it was 1986. I soon realized it was all code. But it was a code called Shadow.

I started to look at the code, to play with it, do some simulations… Soon my boss told me:

»You should do a simulation with asymmetric crystals for monochromators.«

»But asymmetric crystals are not foreseen in this code.«

»Yes, think about how to do it. You should contact Franco Cerrina, he’s the author of Shadow.«

I indeed contacted prof. Cerrina and at that time this was not easy, because there was no direct e-mail. What we had was called a digital deck net, Digital Computers Network. I had to go to another laboratory just to send him an e-mail. Soon, he replied: »Come to see me.« I managed to get some funding to go to the US and for the next two years I spent a good amount of time in Madison, Wisconsin.

I started to work with prof. Cerrina and it was thanks to my work on Shadow that I was called by the European Synchrotron Facility and they offered me a position. But soon I stopped working on Shadow, because I was getting busy with other things.

It was only in 2009 that I contacted prof. Cerrina again. We needed to upgrade our software, so I went back to the US two or three times and started working on what is now Shadow3.

 

In 2010 I organized a trip to go visit again with my family for the summer. We booked the house, we booked the trip… And it was ten days before the departure that I learned that Cerrina died. And since everything was already organized, we decided to visit the US anyway.

There, I went to Cerrina’s laboratory and met his PhD student, who was keeping his possessions. I said to her:

»Tell me everything you were doing recently and I will try to recover what I can.«

And at that moment, she said many things were on this big old Mac. So I proposed to buy this Mac from her, but my home institution wasn’t happy, they saw no reason to buy a second-hand Mac. Even though it contained some important things Cerrina was working on!

Luckily, I managed to get it and I was able to recover many things from it. Moreover, I kept maintaining the Shadow code, because it is a standard software in the community. At the very beginning, the source was not public. Then it was eventually published, but the code was very complicated and nobody managed to recompile that. Thus I decided to clean the code and finally we completed the new version of Shadow in 2011.

 

Three years ago it was time to update Shadow again, especially the interface. One day I discovered Orange and I thought ‘it looked nice’. In that exact time I met Luca [Rebuffi] in Trieste. He got so excited about Orange that his PhD project became redesigning Shadow’s interface with Orange! And now we have OASYS, which is an adaptation of Orange for optical physics. So I hope that in the future, we will have many more users and also many more developers helping us bring simple tools to the scientific community.

 

— Manuel Sanchez del Rio

Intro to Data Mining for Life Scientists

RNA Club Munich has organized Molecular Life of Stem Cells Conference in Ljubljana this past Thursday, Friday and Saturday. They asked us to organize a four-hour workshop on data mining. And here we were: four of us, Ajda, Anze, Marko and myself (Blaz) run a workshop for 25 students with molecular biology and biochemistry background.

img_20160929_133840

We have covered some basic data visualization, modeling (classification) and model scoring, hierarchical clustering and data projection, and finished with a touch of deep-learning by diving into image analysis by deep learning-based embedding.

Related: Data Mining Course at Baylor College of Medicine in Houston

It’s not easy to pack so many new things on data analytics within four hours, but working with Orange helps. This was a hands-on workshop. Students brought their own laptops with Orange and several of its add-ons for bioinformatics and image analytics. We also showed how to prepare one’s own data using Google Forms and designed a questionary, augment it in a class, run it with students and then analyze the questionary with Orange.

pano_20160929_113352

img_0355

img_0353

The hard part of any short course that includes machine learning is how to explain overfitting. The concept is not trivial for data science newcomers, but it is so important it simply cannot be left out. Luckily, Orange has some cool widgets to help us understanding the overfitting. Below is a workflow we have used. We read some data (this time it was a yeast gene expression data set called brown-selected that comes with Orange), “destroyed the data” by randomly permuting the column with class values, trained a classification tree, and observed near perfect results when the model was checked on the training data.

yeast-overfitting-distributions

Sure this works, you are probably saying. The models should have been scored on a separate test set! Exactly, and this is what we have done next with Data Sampler, which lead us to cross-validation and Test & Score widget.

This was a great and interesting short course and we were happy to contribute to the success of the student-run MLSC-2016 conference.

Text Mining: version 0.2.0

Orange3-Text has just recently been polished, updated and enhanced! Our GSoC student Alexey has helped us greatly to achieve another milestone in Orange development and release the latest 0.2.0 version of our text mining add-on. The new release, which is already available on PyPi, includes Wikipedia and SimHash widgets and a rehaul of Bag of Words, Topic Modeling and Corpus Viewer.

 

Wikipedia widget allows retrieving sources from Wikipedia API and can handle multiple queries. It serves as an easy data gathering source and it’s great for exploring text mining techniques. Here we’ve simply queried Wikipedia for articles on Slovenia and Germany and displayed them in Corpus Viewer.

wiki1
Query Wikipedia by entering your query word list in the widget. Put each query on a separate line and run Search.

 

Similarity Hashing widget computes similarity hashes for the given corpus, allowing the user to find duplicates, plagiarism or textual borrowing in the corpus. Here’s an example from Wikipedia, which has a pre-defined structure of articles, making our corpus quite similar. We’ve used Wikipedia widget and retrieved 10 articles for the query ‘Slovenia’. Then we’ve used Similarity Hashing to compute hashes for our text. What we got on the output is a table of 64 binary features (predefined in the SimHash widget), which denote a 64-bit hash size. Then we computed similarities in text by sending Similarity Hashing to Distances. Here we’ve selected cosine row distances and sent the output to Hierarchical Clustering. We can see that we have some similar documents, so we can select and inspect them in Corpus Viewer.

simhash1
Output of Similarity Hashing widget.
simhash
We’ve selected the two most similar documents in Hierarchical Clustering and displayed them in Corpus Viewer.

 

Topic Modeling now includes three modeling algorithms, namely Latent Semantic Indexing (LSP), Latent Dirichlet Allocation (LDA), and Hierarchical Dirichlet Process (HDP). Let’s query Twitter for the latest tweets from Hillary Clinton and Donald Trump. First we preprocess the data and send the output to Topic Modeling. The widget suggests 10 topics, with the most significant words denoting each topic, and outputs topic probabilities for each document.

We can inspect distances between the topics with Distances (cosine) and Hierarchical Clustering. Seems like topics are not extremely author specific, since Hierarchical Clustering often puts Trump and Clinton in the same cluster. We’ve used Average linkage, but you can play around with different linkages and see if you can get better results.

topic-modelling
Example of comparing text by topics.

 

Now we connect Corpus Viewer to Preprocess Text. This is nothing new, but Corpus Viewer now displays also tokens and POS tags. Enable POS Tagger in Preprocess Text. Now open Corpus Viewer and tick the checkbox Show Tokens & Tags. This will display tagged token at the bottom of each document.

corpusviewer
Corpus Viewer can now display tokens and POS tags below each document.

 

This is just a brief overview of what one can do with the new Orange text mining functionalities. Of course, these are just exemplary workflows. If you did textual analysis with great results using any of these widgets, feel free to share it with us! 🙂

Data Mining Course in Houston #2

This was already the second installment of Introduction to Data Mining Course at Baylor College of Medicine in Houston, Texas. Just like the last year, the course was packed. About 50 graduate students, post-docs and a few faculty attended, making the course one of the largest elective PhD courses from over a hundred offered at this prestigious medical school.

houston-class-2016

The course was designed for students with little or no experience in data science. It consisted of seven two-hour lectures, each followed by a homework assignment. We (Blaz and Janez) lectured on data visualization, classification, regression, clustering, data projection and image analytics. We paid special attention to the problems of overfitting, use of regularization, and proper ways of testing and scoring of modeling methods.

The course was hands-on. The lectures were practical. They typically started with some data set and explained data mining techniques through designing data analysis workflows in Orange. Besides some standard machine learning and bioinformatics data sets, we have also painted the data to explore, say, the benefits of different classification techniques or design data sets where k-means clustering would fail.

This year, the course benefited from several new Orange widgets. The recently published interactive k-means widget was used to explain the inner working of this clustering algorithm, and polynomial classification widget was helpful in discussion of decision boundaries of classification algorithms. Silhouette plot was used to show how to evaluate and explore the results of clustering. And finally, we explained concepts from deep learning using image embedding to show how already trained networks can be used for clustering and classification of images.

Image Analytics

Visualizing Gradient Descent

This is a guest blog from the Google Summer of Code project.

 

Gradient Descent was implemented as a part of my Google Summer of Code project and it is available in the Orange3-Educational add-on. It simulates gradient descent for either Logistic or Linear regression, depending on the type of the input data. Gradient descent is iterative approach to optimize model parameters that minimize the cost function. In machine learning, the cost function corresponds to prediction error when the model is used on the training data set.

Gradient Descent widget takes data on input and outputs the model and its coefficients.

gradient-descent-flow

The widget displays the value of the cost function given two parameters of the model. For linear regression, we consider feature from the training set with the parameters being the intercept and the slope. For logistic regression, the widget considers two feature and their associated multiplicative parameters, setting the intercept to zero. Screenshot bellow shows gradient descent on a Iris data set, where we consider petal length and sepal width on the input and predict the probability that iris comes from the family of Iris versicolor.

gradient-descent1-stamped

  1. The type of the model used (either Logistic regression or Linear regression)
  2. Input features (one for X and one for Y axis) and the target class
  3. Learning rate is the step size of the gradient descent
  4. In a single iteration step, stochastic approach considers only a single data instance (instead of entire training set). Convergence in terms of iterations steps is slower, and we can instruct the widget to display the progress of optimization only after given number of steps (Step size)
  5. Step through the algorithm (steps can be reverted with step back button)
  6. Run optimization until convergence

 

Following shows gradient descent for linear regression using The Boston Housing Data Set when trying to predict the median value of a house given its age.

gradient-descent-age

On the left we use the regular and on the right the stochastic gradient descent. While the regular descent goes straight to the target, the path of stochastic is not as smooth.

We can use the widget to simulate some dangerous, unwanted behavior of gradient descent. The following screenshots show two extreme cases with too high learning rate where optimization function never converges, and a low learning rate where convergence is painfully slow.

gradient-descent-extrems

The two problems as illustrated above are the reason that many implementations of numerical optimization use adaptive learning rates. We can simulate this in the widget by modifying the learning rate for each step of the optimization.