Preparing Scraped Data

One of the key questions of every data analysis is how to get the data and put it in the right form(at). In this post I’ll show you how to easily get the data from the web and transfer it to a file Orange can read.

Related: Creating a new data table in Orange through Python

 

First, we’ll have to do some scripting. We’ll use a couple of Python libraries – urllib.requests fetching the data, BeautifulSoup for reading it, csv for writing it and regular expressions for extracting the right data.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import re

Ok, we’ve imported the all the libraries we’ll need. Now we will scrape the data from our own blog to see how many posts we’ve written throughout the years.

html = urlopen('http://blog.biolab.si')
soup = BeautifulSoup(html.read(), 'lxml')

The first line opens the address of the site we want to scrape. In our case this is our blog. The second line retrieves a html response from the site, which is our raw text. It looks like this:

<aside id="archives-2" class="widget widget_archive"><h3 class="widget-title">Archives</h3>
<ul>
   <li><a href='http://blog.biolab.si/2017/01/'>January 2017</a>&nbsp;(1)</li>
   <li><a href='http://blog.biolab.si/2016/12/'>December 2016</a>&nbsp;(3)</li>
   <li><a href='http://blog.biolab.si/2016/11/'>November 2016</a>&nbsp;(4)</li>
   <li><a href='http://blog.biolab.si/2016/10/'>October 2016</a>&nbsp;(3)</li>
   <li><a href='http://blog.biolab.si/2016/09/'>September 2016</a>&nbsp;(2)</li>
   <li><a href='http://blog.biolab.si/2016/08/'>August 2016</a>&nbsp;(5)</li>
   <li><a href='http://blog.biolab.si/2016/07/'>July 2016</a>&nbsp;(3)</li>....

Ok, html is nice, but we can’t really do data analysis with this. We will have to transform this output into something sensible. How about .csv, a simple comma-demilited format Orange can recognize?

with open('scraped.csv', 'w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',')

We created a new file called ‘scraped.csv‘ to which we will write our content (‘w’ parameter means write). Then we defined the writer and set the delimiter to comma.

Now we need to add the header row, so Orange will know what are the column names. We add this just after csvwriter variable.

csvwriter.writerow(["Date", "No_of_Blogs"])

Now we have two columns, one named ‘Date’ and the other ‘No_of_Blogs’. The final step is to extract the data. We have a bunch of lines in html, but the one we’re interested in is in a section ‘aside’ and has an id ‘archives-2‘. We will first extract only this section (.find(id=’archives-2′) and get all the lines of the archive with the tag ‘li’ (.find_all(‘li’)):

for item in soup.find(id="archives-2").find_all('li'):

This is the result of print(item).

<li><a href="http://blog.biolab.si/2017/01/">January 2017</a> (1)</li>

Now we need to get the actual content from each line. The first part we need is the date of the archived content. Orange can read dates, but they need to come in the right format. We will extract the date from href part with item.a.get(‘href’). Then we need to extract only digits from it as we’re not interested in the rest of the link. We do this with Regex for finding digits:

date = re.findall(r'\d+', item.a.get('href'))

Regex’s findall function returns a list, in our case containing two items – the year and month of archived content. The second part of our data is the number of blogs archived in a particular month. We will again extract this with a Regex digit search, but this time we will be extracting data from the actual content – ‘item.contents[1]‘.

digits = re.findall(r'\d+', item.contents[1])

Finally, we need to write each line to a .csv file we created above.

csvwriter.writerow(["%s-%s-01" % (date[0], date[1]), digits[0]])

Here, we formatted the date into an ISO-standard format Orange recognizes as time variable (“%s-%s-01” % (date[0], date[1])), while the second part is simply a count of our blog posts.

This is the entire code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import re

html = urlopen('http://blog.biolab.si')
soup = BeautifulSoup(html.read(), 'lxml')

with open('scraped.csv', 'w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',')
    csvwriter.writerow(["Date", "No_of_Blogs"])
    for item in soup.find(id="archives-2").find_all('li'):
        date = re.findall(r'\d+', item.a.get('href'))
        digits = re.findall(r'\d+', item.contents[1])
        csvwriter.writerow(["%s-%s-01" % (date[0], date[1]), digits[0]])

Related: Scripting with Time Variable

 

Now let’s load this in Orange. File widget can easily read .csv formats and also correctly identifies the two column types, datetime and numeric. A quick glance into the Data Table…

Everything looks ok. We can use Timeseries add-on to inspect how many blogs we’ve written each month since 2010. Connect As Timeseries widget to File. Orange will automatically suggest to use Date as our time variable. Finally, we’ll plot the data with Line Chart. This is the curve of our blog activity.

The example is extremely simple. A somewhat proficient user can extract much more interesting data than a simple blog count, but one always needs to keep in mind the legal aspects of web scraping. Nevertheless, this is a popular and fruitful way to extract and explore the data!

Data Preparation for Machine Learning

We’ve said it numerous times and we’re going to say it again. Data preparation is crucial for any data analysis. If your data is messy, there’s no way you can make sense of it, let alone a computer. Computers are great at handling large, even enormous data sets, speedy computing and recognizing patterns. But they fail miserably if you give them the wrong input. Also some classification methods work better with binary values, other with continuous, so it is important to know how to treat your data properly.

Orange is well equipped for such tasks.

 

Widget no. 1: Preprocess

Preprocess is there to handle a big share of your preprocessing tasks.

 

Original data.

 

  • It can normalize numerical data variables. Say we have a fictional data set of people employed in your company. We want to know which employees are more likely to go on holiday, based on the yearly income, years employed in your company and total years of experience in the industry. If you plot this in heat map, you would see a bold yellow line at ‘yearly income’. This obviously happens because yearly income has much higher values than years of experience or years employed by your company. You would naturally like the wage not to overweight the rest of the feature set, so normalization is the way to go. Normalization will transform your values to relative terms, that is, say (depending on the type of normalization) on a scale from 0 to 1. Now Heat Map neatly shows that people who’ve been employed longer and have a higher wage more often go on holidays. (Yes, this is a totally fictional data set, but you see the point.)

heatmap1
  no normalization

heatmap2
   normalized data

 

  • It can impute missing values. Average or most frequent missing value imputation might seem as overly simple, but it actually works most of the time. Also, all the learners that require imputation do it implicitly, so the user doesn’t have to configure yet another widget for that.
  • If you want to compare your results against a randomly mixed data set, select ‘Randomize’ or if you want to select relevant features, this is the widget for it.

Preprocessing needs to be used with caution and understanding of your data to avoid losing important information or, worse, overfitting the model. A good example is a case of paramedics, who usually don’t record pulse if it is normal. Missing values here thus cannot be imputed by an average value or random number, but as a distinct value (normal pulse). Domain knowledge is always crucial for data preparation.

 

Widget no. 2: Discretize

For certain tasks you might want to resort to binning, which is what Discretize does. It effectively distributes your continuous values into a selected number of bins, thus making the variable discrete-like. You can either discretize all your data variables at once, using selected discretization type, or select a particular discretization method for each attribute. The cool thing is the transformation is already displayed in the widget, so you instantly know what you’re getting in the end. A good example of discretization would be having a data set of your customers with their age recorded. It would make little sense to segment customers by each particular age, so binning them into 4 age groups (young, young-adult, middle-aged, senior) would be a great solution. Also some visualizations require feature transformation – Sieve Diagram is currently one such widget. Mosaic Display, however, has the transformation already implemented internally.

 

discretize1
original data

 

discretize2
Discretized data with ‘years employed’ lower or higher then/equal to 8 (same for ‘yearly income’ and ‘experience in the industry’.

 

Widget no. 3: Continuize

This widget essentially creates new attributes out of your discrete ones. If you have, for example, an attribute with people’s eye color, where values can be either blue, brown or green, you would probably want to have three separate attributes ‘blue’, ‘green’ and ‘brown’ with 0 or 1 if a person has that eye color. Some learners perform much better if data is transformed in such a way. You can also only have attributes where you would presume 0 is a normal condition and would only like to have deviations from the normal state recorded (‘target or first value as base’) or the normal state would be the most common value (‘most frequent value as base’). Continuize widget offers you a lot of room to play. Best thing is to select a small data set with discrete values, connect it to Continuize and then further to Data Table and change the parameters. This is how you can observe the transformations in real time. It is useful for projecting discrete data points in Linear Projection.

 

continuize1
Original data.

 

continuize2
Continuized data with two new columns – attribute ‘position’ was replaced by attributes ‘position=office worker’ and ‘position=technical staff’ (same for ‘gender’).

 

Widget no. 4: Purge Domain

Get a broom and sort your data! That’s what Purge Domain does. If all of the values of some attributes are constant, it will remove these attributes. If you have unused (empty) attributes in your data, it will remove them. Effectively, you will get a nice and comprehensive data set in the end.

purge1
Original data.

 

purge2
Empty columns and columns with the same (constant) value were removed.

 

Of course, don’t forget to include all these procedures into your report with the ‘Report’ button! 🙂