Preparing Scraped Data

One of the key questions of every data analysis is how to get the data and put it in the right form(at). In this post I’ll show you how to easily get the data from the web and transfer it to a file Orange can read.

Related: Creating a new data table in Orange through Python


First, we’ll have to do some scripting. We’ll use a couple of Python libraries – urllib.requests fetching the data, BeautifulSoup for reading it, csv for writing it and regular expressions for extracting the right data.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import re

Ok, we’ve imported the all the libraries we’ll need. Now we will scrape the data from our own blog to see how many posts we’ve written throughout the years.

html = urlopen('')
soup = BeautifulSoup(, 'lxml')

The first line opens the address of the site we want to scrape. In our case this is our blog. The second line retrieves a html response from the site, which is our raw text. It looks like this:

<aside id="archives-2" class="widget widget_archive"><h3 class="widget-title">Archives</h3>
   <li><a href=''>January 2017</a>&nbsp;(1)</li>
   <li><a href=''>December 2016</a>&nbsp;(3)</li>
   <li><a href=''>November 2016</a>&nbsp;(4)</li>
   <li><a href=''>October 2016</a>&nbsp;(3)</li>
   <li><a href=''>September 2016</a>&nbsp;(2)</li>
   <li><a href=''>August 2016</a>&nbsp;(5)</li>
   <li><a href=''>July 2016</a>&nbsp;(3)</li>....

Ok, html is nice, but we can’t really do data analysis with this. We will have to transform this output into something sensible. How about .csv, a simple comma-demilited format Orange can recognize?

with open('scraped.csv', 'w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',')

We created a new file called ‘scraped.csv‘ to which we will write our content (‘w’ parameter means write). Then we defined the writer and set the delimiter to comma.

Now we need to add the header row, so Orange will know what are the column names. We add this just after csvwriter variable.

csvwriter.writerow(["Date", "No_of_Blogs"])

Now we have two columns, one named ‘Date’ and the other ‘No_of_Blogs’. The final step is to extract the data. We have a bunch of lines in html, but the one we’re interested in is in a section ‘aside’ and has an id ‘archives-2‘. We will first extract only this section (.find(id=’archives-2′) and get all the lines of the archive with the tag ‘li’ (.find_all(‘li’)):

for item in soup.find(id="archives-2").find_all('li'):

This is the result of print(item).

<li><a href="">January 2017</a> (1)</li>

Now we need to get the actual content from each line. The first part we need is the date of the archived content. Orange can read dates, but they need to come in the right format. We will extract the date from href part with item.a.get(‘href’). Then we need to extract only digits from it as we’re not interested in the rest of the link. We do this with Regex for finding digits:

date = re.findall(r'\d+', item.a.get('href'))

Regex’s findall function returns a list, in our case containing two items – the year and month of archived content. The second part of our data is the number of blogs archived in a particular month. We will again extract this with a Regex digit search, but this time we will be extracting data from the actual content – ‘item.contents[1]‘.

digits = re.findall(r'\d+', item.contents[1])

Finally, we need to write each line to a .csv file we created above.

csvwriter.writerow(["%s-%s-01" % (date[0], date[1]), digits[0]])

Here, we formatted the date into an ISO-standard format Orange recognizes as time variable (“%s-%s-01” % (date[0], date[1])), while the second part is simply a count of our blog posts.

This is the entire code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import re

html = urlopen('')
soup = BeautifulSoup(, 'lxml')

with open('scraped.csv', 'w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',')
    csvwriter.writerow(["Date", "No_of_Blogs"])
    for item in soup.find(id="archives-2").find_all('li'):
        date = re.findall(r'\d+', item.a.get('href'))
        digits = re.findall(r'\d+', item.contents[1])
        csvwriter.writerow(["%s-%s-01" % (date[0], date[1]), digits[0]])

Related: Scripting with Time Variable


Now let’s load this in Orange. File widget can easily read .csv formats and also correctly identifies the two column types, datetime and numeric. A quick glance into the Data Table…

Everything looks ok. We can use Timeseries add-on to inspect how many blogs we’ve written each month since 2010. Connect As Timeseries widget to File. Orange will automatically suggest to use Date as our time variable. Finally, we’ll plot the data with Line Chart. This is the curve of our blog activity.

The example is extremely simple. A somewhat proficient user can extract much more interesting data than a simple blog count, but one always needs to keep in mind the legal aspects of web scraping. Nevertheless, this is a popular and fruitful way to extract and explore the data!

Data Preparation for Machine Learning

We’ve said it numerous times and we’re going to say it again. Data preparation is crucial for any data analysis. If your data is messy, there’s no way you can make sense of it, let alone a computer. Computers are great at handling large, even enormous data sets, speedy computing and recognizing patterns. But they fail miserably if you give them the wrong input. Also some classification methods work better with binary values, other with continuous, so it is important to know how to treat your data properly.

Orange is well equipped for such tasks.


Widget no. 1: Preprocess

Preprocess is there to handle a big share of your preprocessing tasks.


Original data.


  • It can normalize numerical data variables. Say we have a fictional data set of people employed in your company. We want to know which employees are more likely to go on holiday, based on the yearly income, years employed in your company and total years of experience in the industry. If you plot this in heat map, you would see a bold yellow line at ‘yearly income’. This obviously happens because yearly income has much higher values than years of experience or years employed by your company. You would naturally like the wage not to overweight the rest of the feature set, so normalization is the way to go. Normalization will transform your values to relative terms, that is, say (depending on the type of normalization) on a scale from 0 to 1. Now Heat Map neatly shows that people who’ve been employed longer and have a higher wage more often go on holidays. (Yes, this is a totally fictional data set, but you see the point.)

  no normalization

   normalized data


  • It can impute missing values. Average or most frequent missing value imputation might seem as overly simple, but it actually works most of the time. Also, all the learners that require imputation do it implicitly, so the user doesn’t have to configure yet another widget for that.
  • If you want to compare your results against a randomly mixed data set, select ‘Randomize’ or if you want to select relevant features, this is the widget for it.

Preprocessing needs to be used with caution and understanding of your data to avoid losing important information or, worse, overfitting the model. A good example is a case of paramedics, who usually don’t record pulse if it is normal. Missing values here thus cannot be imputed by an average value or random number, but as a distinct value (normal pulse). Domain knowledge is always crucial for data preparation.


Widget no. 2: Discretize

For certain tasks you might want to resort to binning, which is what Discretize does. It effectively distributes your continuous values into a selected number of bins, thus making the variable discrete-like. You can either discretize all your data variables at once, using selected discretization type, or select a particular discretization method for each attribute. The cool thing is the transformation is already displayed in the widget, so you instantly know what you’re getting in the end. A good example of discretization would be having a data set of your customers with their age recorded. It would make little sense to segment customers by each particular age, so binning them into 4 age groups (young, young-adult, middle-aged, senior) would be a great solution. Also some visualizations require feature transformation – Sieve Diagram is currently one such widget. Mosaic Display, however, has the transformation already implemented internally.


original data


Discretized data with ‘years employed’ lower or higher then/equal to 8 (same for ‘yearly income’ and ‘experience in the industry’.


Widget no. 3: Continuize

This widget essentially creates new attributes out of your discrete ones. If you have, for example, an attribute with people’s eye color, where values can be either blue, brown or green, you would probably want to have three separate attributes ‘blue’, ‘green’ and ‘brown’ with 0 or 1 if a person has that eye color. Some learners perform much better if data is transformed in such a way. You can also only have attributes where you would presume 0 is a normal condition and would only like to have deviations from the normal state recorded (‘target or first value as base’) or the normal state would be the most common value (‘most frequent value as base’). Continuize widget offers you a lot of room to play. Best thing is to select a small data set with discrete values, connect it to Continuize and then further to Data Table and change the parameters. This is how you can observe the transformations in real time. It is useful for projecting discrete data points in Linear Projection.


Original data.


Continuized data with two new columns – attribute ‘position’ was replaced by attributes ‘position=office worker’ and ‘position=technical staff’ (same for ‘gender’).


Widget no. 4: Purge Domain

Get a broom and sort your data! That’s what Purge Domain does. If all of the values of some attributes are constant, it will remove these attributes. If you have unused (empty) attributes in your data, it will remove them. Effectively, you will get a nice and comprehensive data set in the end.

Original data.


Empty columns and columns with the same (constant) value were removed.


Of course, don’t forget to include all these procedures into your report with the ‘Report’ button! 🙂

BDTN 2016 Workshop: Introduction to Data Science

Every year BEST Ljubljana organizes BEST Days of Technology and Sciences, an event hosting a broad variety of workshops, hackathons and lectures for the students of natural sciences and technology. Introduction to Data Science, organized by our own Laboratory for Bioinformatics, was this year one of them.

Related: Intro to Data Mining for Life Scientists

The task was to teach and explain basic data mining concepts and techniques in four hours. To complete beginners. Not daunting at all…

Luckily, we had Orange at hand. First, we showed how the program works and how to easily import data into the software. We created a poll using Google Forms on the fly and imported the results from Google Sheets into Orange.

To get the first impression of our data, we used Distributions and Scatter Plot. This was just to show how to approach the construction and simple visual exploration on any new data set. Then we delved deep into the workings of classification with Classification Tree and Tree Viewer and showed how easy it is to fall into the trap of overfitting (and how to avoid it). Another topic was clustering and how to relate similar data instances to one another. Finally, we had some fun with ImageAnalytics add-on and observed whether we can detect wrongly labelled microscopy images with machine learning.

Related: Data Mining Course in Houston #2

These workshops are not only fun, but an amazing learning opportunity for us as well, as they show how our users think and how to even further improve Orange.

Dimensionality Reduction by Manifold Learning

The new Orange release (v. 3.3.9) welcomed a few wonderful additions to its widget family, including Manifold Learning widget. The widget reduces the dimensionality of the high-dimensional data and is thus wonderful in combination with visualization widgets.

Manifold Learning widget has a simple interface with powerful features.


Manifold Learning widget offers five embedding techniques based on scikit-learn library: t-SNE, MDS, Isomap, Locally Linear Embedding and Spectral Embedding. They each handle the mapping differently and also have a specific set of parameters.

Related: Principal Component Analysis (video)

For example, a popular t-SNE requires only a metric (e.g. cosine distance). In the demonstration of this widget, we output 2 components, since they are the easiest to visualize and make sense of.

First, let’s load the data and open it in Scatter Plot. Not a very informative visualization, right? The dots from an unrecognizable square in 2D.

S-curve data in Scatter Plot. Data points form an uninformative square.


Let’s use embeddings to make things a bit more informative. This is how the data looks like with a t-SNE embedding. The data is starting to have a shape and the data points colored according to regression class reveal a beautiful gradient.

t-SNE embedding shows an S shape of the data.


Ok, how about MDS? This is beyond our expectations!



There’s a plethora of options with embeddings. You can play around with ImageNet embeddings and plot them in 2D or use any of your own high-dimensional data and discover interesting visualizations! Although t-SNE is nowadays probably the most popular dimensionality reduction technique used in combination with scatterplot visualization, do not underestimate the value of other manifold learning techniques. For one, we often find that MDS works fine as well.


Go, experiment!

Data Mining for Political Scientists

Being a political scientist, I did not even hear about data mining before I’ve joined Biolab. And naturally, as with all good things, data mining started to grow on me. Give me some data, connect a bunch of widgets and see the magic happen!

But hold on! There are still many social scientists out there who haven’t yet heard about the wonderful world of data mining, text mining and machine learning. So I’ve made it my mission to spread the word. And that was the spirit that led me back to my former university – School of Political Sciences, University of Bologna.

University of Bologna is the oldest university in the world and has one of the best departments for political sciences in Europe. I held a lecture Digital Research – Data Mining for Political Scientists for MIREES students, who are specializing in research and studies in Central and Eastern Europe.

Lecture at University of Bologna
Lecture at University of Bologna

The main goal of the lecture was to lay out the possibilities that contemporary technology offers to researchers and to showcase a few simple text mining tasks in Orange. We analysed Trump’s and Clinton’s Twitter timeline and discovered that their tweets are highly distinct from one another and that you can easily find significant words they’re using in their tweets. Moreover, we’ve discovered that Trump is much better at social media than Clinton, creating highly likable and shareable content and inventing his own hashtags. Could that be a tell-tale sign of his recent victory?

Perhaps. Our future, data-mining savvy political scientists will decide. Below, you can see some examples of the workflows presented at the workshop.

Author predictions from Tweet content. Logistic Regression reports on 92% classification accuracy and AUC score. Confusion Matrix can output misclassified tweets to Corpus Viewer, where we can inspect these tweets further.


Word Cloud from preprocessed tweets. We removed stopwords and punctuation to find frequencies for meaningful words only.


Word Enrichment by Author. First we find Donald’s tweets with Select Rows and then compare them to the entire corpus in Word Enrichment. The widget outputs a ranked list of significant words for the provided subset. We do the same for Hillary’s tweets.


Finding potential topics with LDA.


Finally, we offered a sneak peek of our recent Tweet Profiler widget. Tweet Profiler is intended for sentiment analysis of tweets and can output classes. probabilities and embeddings. The widget is not yet officially available, but will be included in the upcoming release.

Celebrity Lookalike or How to Make Students Love Machine Learning

Recently we’ve been participating at Days of Computer Science, organized by the Museum of Post and Telecommunications and the Faculty of Computer and Information Science, University of Ljubljana, Slovenia. The project brought together pupils and students from around the country and hopefully showed them what computer science is mostly about. Most children would think programming is just typing lines of code. But it’s more than that. It’s a way of thinking, a way to solve problems creatively and efficiently. And even better, computer science can be used for solving a great variety of problems.

Related: On teaching data science with Orange

Orange team has prepared a small demo project called Celebrity Lookalike. We found 65 celebrity photos online and loaded them in Orange. Next we cropped photos to faces and turned them black and white, to avoid bias in background and color. Next we inferred embeddings with ImageNet widget and got 2048 features, which are the penultimate result of the ImageNet neural network.

We find faces in photos and turn them to black and white. This eliminates the effect of background and distinct colors for embeddings.
We find faces in photos and turn them to black and white. This eliminates the effect of the background and distinct colors for embeddings.


Still, we needed a reference photo to find the celebrity lookalike for. Students could take a selfie and similarly extracted black and white face out of it. Embeddings were computed and sent to Neighbors widget. Neighbors finds n closest neighbors based on the defined distance measure to the provided reference. We decided to output 10 closest neighbors by cosine distance.

Celebrity Lookalike workflow. We load photos, find faces and compute embeddings. We do the same for our Webcam Capture. Then we find 10 closest neighbors and observe the results in Lookalike widget.


Finally, we used Lookalike widget to display the result. Students found it hilarious when curly boys were the Queen of England and girls with glasses Steve Jobs. They were actively trying to discover how the algorithm works by taking photo of a statue, person with or without glasses, with hats on or by making a funny face.


Hopefully this inspires a new generation of students to become scientists, researchers and to actively find solutions to their problems. Coding or not. 🙂


Note: Most widgets we have designed for this projects (like Face Detector, Webcam Capture, and Lookalike) are available in Orange3-Prototypes and are not actively maintained. They can, however, be used for personal projects and sheer fun. Orange does not own the copyright of the images.

Top 100 Changemakers in Central and Eastern Europe

Recently Orange and one of its inventors, Blaž Zupan, have been recognized as one of the top 100 changemakers in the region. A 2016 New Europe 100 is an annual list of innovators and entrepreneurs in Central and Eastern Europe highlighting novel approaches to pressing problems.

Orange has been recognized for making data more approachable, which has been our goal from the get-go. The tool is continually being developed with the end user in mind – someone who wants to analyze his/her data quickly, visually, interactively, and efficiently. We’re always thinking hard how to expose valuable information in the data, how to improve the user experience, which defaults are the most appropriate for the method, and, finally, how to intuitively teach people about data mining.

This nomination is a great validation of our efforts and it only makes us work harder. Because every research should be fruitful and fun!


10 Tips and Tricks for Using Orange

TIP #1: Follow tutorials and example workflows to get started.

It’s difficult to start using new software. Where does one start, especially a total novice in data mining? For this exact reason we’ve prepared Getting Started With Orange – YouTube tutorials for complete beginners. Example workflows on the other hand can be accessed via Help – Examples.


TIP #2: Make use of Orange documentation.

You can access it in three ways:

  1. Press F1 when the widget is selected. This will open help screen.
  2. Select Widget – Help when the widget is selected. It works the same as above.
  3. Visit online documentation.


TIP #3: Embed your help screen.

Drag and drop help screen to the side of your Orange canvas. It will become embedded in the canvas. You can also make it narrower, allowing for a full-size analysis while exploring the docs.



TIP #4: Use right-click.

Right-click on the canvas and a widget menu will appear. Start typing the widget you’re looking for and press Enter when the widget becomes the top widget. This will place the widget onto the canvas immediately. You can also navigate the menu with up and down.


TIP #5: Turn off channel names.

Sometimes it is annoying to see channel names above widget links. If you’re already comfortable using Orange, you can turn them off in Options – Settings. Turn off ‘Show channel names between widgets’.


TIP #6: Hide control pane.

Once you’ve set the parameters, you’d probably want to focus just on visualizations. There’s a simple way to do this in Orange. Click on the split between the control pane and visualization pane – you should see a hand appearing instead of a cursor. Click and observe how the control pane gets hidden away neatly. To make it reappear, click the split again.




TIP #7: Label your data.

So you’ve plotted your data, but have no idea what you’re seeing. Use annotation! In some widgets you will see a drop-down menu called Annotation, while in others it will be called a Label. This will mark your data points with suitable labels, making your MDS plots and Scatter Plots much more informative. Scatter Plot also enables you to label only selected points for better clarity.


TIP #8: Find your plot.

Scrolled around and lost the plot? Zoomed in too much? To re-position the plot click ‘Reset zoom’ and the visualization will jump snugly into the visualization pane. Comes in handy when browsing the subsets and trying to see the bigger picture every now and then.




TIP #9: Reset widget settings.

Orange is geared to remember your last settings, thus assisting you in a rapid analysis. However, sometimes you need to start anew. Go to Options – Reset widget settings… and restart Orange. This will return Orange to its original state.


TIP #10: Use Educational add-on.

To learn about how some algorithms work, use Orange3-Educational add-on. It contains 4 widgets that will help you get behind the scenes of some famous algorithms. And since they’re interactive, they’re also a lot of fun!






The Story of Shadow and Orange

This is a long story. I remember when started my PhD in Italy. There I met a researcher and he said to me: »You should do some simulations on x-ray optics beamline.« »Yes, but how should I do that?« He gave me a big tape, it was 1986. I soon realized it was all code. But it was a code called Shadow.

I started to look at the code, to play with it, do some simulations… Soon my boss told me:

»You should do a simulation with asymmetric crystals for monochromators.«

»But asymmetric crystals are not foreseen in this code.«

»Yes, think about how to do it. You should contact Franco Cerrina, he’s the author of Shadow.«

I indeed contacted prof. Cerrina and at that time this was not easy, because there was no direct e-mail. What we had was called a digital deck net, Digital Computers Network. I had to go to another laboratory just to send him an e-mail. Soon, he replied: »Come to see me.« I managed to get some funding to go to the US and for the next two years I spent a good amount of time in Madison, Wisconsin.

I started to work with prof. Cerrina and it was thanks to my work on Shadow that I was called by the European Synchrotron Facility and they offered me a position. But soon I stopped working on Shadow, because I was getting busy with other things.

It was only in 2009 that I contacted prof. Cerrina again. We needed to upgrade our software, so I went back to the US two or three times and started working on what is now Shadow3.


In 2010 I organized a trip to go visit again with my family for the summer. We booked the house, we booked the trip… And it was ten days before the departure that I learned that Cerrina died. And since everything was already organized, we decided to visit the US anyway.

There, I went to Cerrina’s laboratory and met his PhD student, who was keeping his possessions. I said to her:

»Tell me everything you were doing recently and I will try to recover what I can.«

And at that moment, she said many things were on this big old Mac. So I proposed to buy this Mac from her, but my home institution wasn’t happy, they saw no reason to buy a second-hand Mac. Even though it contained some important things Cerrina was working on!

Luckily, I managed to get it and I was able to recover many things from it. Moreover, I kept maintaining the Shadow code, because it is a standard software in the community. At the very beginning, the source was not public. Then it was eventually published, but the code was very complicated and nobody managed to recompile that. Thus I decided to clean the code and finally we completed the new version of Shadow in 2011.


Three years ago it was time to update Shadow again, especially the interface. One day I discovered Orange and I thought ‘it looked nice’. In that exact time I met Luca [Rebuffi] in Trieste. He got so excited about Orange that his PhD project became redesigning Shadow’s interface with Orange! And now we have OASYS, which is an adaptation of Orange for optical physics. So I hope that in the future, we will have many more users and also many more developers helping us bring simple tools to the scientific community.


— Manuel Sanchez del Rio

Text Mining: version 0.2.0

Orange3-Text has just recently been polished, updated and enhanced! Our GSoC student Alexey has helped us greatly to achieve another milestone in Orange development and release the latest 0.2.0 version of our text mining add-on. The new release, which is already available on PyPi, includes Wikipedia and SimHash widgets and a rehaul of Bag of Words, Topic Modeling and Corpus Viewer.


Wikipedia widget allows retrieving sources from Wikipedia API and can handle multiple queries. It serves as an easy data gathering source and it’s great for exploring text mining techniques. Here we’ve simply queried Wikipedia for articles on Slovenia and Germany and displayed them in Corpus Viewer.

Query Wikipedia by entering your query word list in the widget. Put each query on a separate line and run Search.


Similarity Hashing widget computes similarity hashes for the given corpus, allowing the user to find duplicates, plagiarism or textual borrowing in the corpus. Here’s an example from Wikipedia, which has a pre-defined structure of articles, making our corpus quite similar. We’ve used Wikipedia widget and retrieved 10 articles for the query ‘Slovenia’. Then we’ve used Similarity Hashing to compute hashes for our text. What we got on the output is a table of 64 binary features (predefined in the SimHash widget), which denote a 64-bit hash size. Then we computed similarities in text by sending Similarity Hashing to Distances. Here we’ve selected cosine row distances and sent the output to Hierarchical Clustering. We can see that we have some similar documents, so we can select and inspect them in Corpus Viewer.

Output of Similarity Hashing widget.
We’ve selected the two most similar documents in Hierarchical Clustering and displayed them in Corpus Viewer.


Topic Modeling now includes three modeling algorithms, namely Latent Semantic Indexing (LSP), Latent Dirichlet Allocation (LDA), and Hierarchical Dirichlet Process (HDP). Let’s query Twitter for the latest tweets from Hillary Clinton and Donald Trump. First we preprocess the data and send the output to Topic Modeling. The widget suggests 10 topics, with the most significant words denoting each topic, and outputs topic probabilities for each document.

We can inspect distances between the topics with Distances (cosine) and Hierarchical Clustering. Seems like topics are not extremely author specific, since Hierarchical Clustering often puts Trump and Clinton in the same cluster. We’ve used Average linkage, but you can play around with different linkages and see if you can get better results.

Example of comparing text by topics.


Now we connect Corpus Viewer to Preprocess Text. This is nothing new, but Corpus Viewer now displays also tokens and POS tags. Enable POS Tagger in Preprocess Text. Now open Corpus Viewer and tick the checkbox Show Tokens & Tags. This will display tagged token at the bottom of each document.

Corpus Viewer can now display tokens and POS tags below each document.


This is just a brief overview of what one can do with the new Orange text mining functionalities. Of course, these are just exemplary workflows. If you did textual analysis with great results using any of these widgets, feel free to share it with us! 🙂