# Overfitting and Regularization

A week ago I used Orange to explain the effects of regularization. This was the second lecture in the Data Mining class, the first one was on linear regression. My introduction to the benefits of regularization used a simple data set with a single input attribute and a continuous class. I drew a data set in Orange, and then used Polynomial Regression widget (from Prototypes add-on) to plot the linear fit. This widget can also expand the data set by adding columns with powers of original attribute x, thereby augmenting the training set with x^p, where x is our original attribute and p an integer going from 2 to K. The polynomial expansion of data sets allows linear regression model to nicely fit the data, and with higher K to overfit it to extreme, especially if the number of data points in the training set is low.

We have already blogged about this experiment a while ago, showing that it is easy to see that linear regression coefficients blow out of proportion with increasing K. This leads to the idea that linear regression should not only minimize the squared error when predicting the value of dependent variable in the training set, but also keep model coefficients low, or better, penalize any high value of coefficients. This procedure is called regularization. Based on the type of penalty (sum of coefficient squared or sum of absolute values), the regularization is referred to L1 or L2, or, ridge and lasso regression.

It is quite easy to play with regularized models in Orange by attaching a Linear Regression widget to Polynomial Regression, in this way substituting the default model used in Polynomial Regression with the one designed in Linear Regression widget. This makes available different kinds of regularization. This workflow can be used to show that the regularized models less overfit the data, and that the overfitting depends on the regularization coefficient which governs the degree of penalty stemming from the value of coefficients of the linear model.

I also use this workflow to show the difference between L1 and L2 regularization. The change of the type of regularization is most pronounced in the table of coefficients (Data Table widget), where with L1 regularization it is clear that this procedure results in many of those being 0. Try this with high value for degree of polynomial expansion, and a data set with about 10 data points. Also, try changing the regularization regularization strength (Linear Regression widget).

While the effects of overfitting and regularization are nicely visible in the plot in Polynomial Regression widget, machine learning models are really about predictions. And the quality of predictions should really be estimated on independent test set. So at this stage of the lecture I needed to introduce the model scoring, that is, a measure that tells me how well my model inferred on the training set performs on the test set. For simplicity, I chose to introduce root mean squared error (RMSE) and then crafted the following workflow.

Here, I draw the data set (Paint Data, about 20 data instances), assigned y as the target variable (Select Columns), split the data to training and test sets of approximately equal sizes (Data Sampler), and pass training and test data and linear model to the Test & Score widget. Then I can use linear regression with no regularization, and expect how RMSE changes with changing the degree of the polynomial. I can alternate between Test on train data and Test on test data (Test & Score widget). In the class I have used the blackboard to record this dependency. For the data from the figure, I got the following table:

Poly K RMSE Train RMSE Test
0 0.147 0.138
1 0.155 0.192
2 0.049 0.063
3 0.049 0.063
4 0.049 0.067
5 0.040 0.408
6 0.040 0.574
7 0.033 2.681
8 0.001 5.734
9 0.000 4.776

That’s it. For the class of computer scientists, one may do all this in scripting, but for any other audience, or for any introductory lesson, explaining of regularization with Orange widgets is a lot of fun.

# Model-Based Feature Scoring

Feature scoring and ranking can help in understanding the data in supervised settings. Orange includes a number of standard feature scoring procedures one can access in the Rank widget. Moreover, a number of modeling techniques, like linear or logistic regression, can rank features explicitly through assignment of weights. Trained models like random forests have their own methods for feature scoring. Models inferred by these modeling techniques depend on their parameters, like type and level of regularization for logistic regression. Same holds for feature weight: any change of parameters of the modeling techniques would change the resulting feature scores.

It would thus be great if we could observe these changes and compare feature ranking provided by various machine learning methods. For this purpose, the Rank widget recently got a new input channel called scorer. We can attach any learner that can provide feature scores to the input of Rank, and then observe the ranking in the Rank table.

Say, for the famous voting data set (File widget, Browse documentation data sets), the last two feature score columns were obtained by random forest and logistic regression with L1 regularization (C=0.1). Try changing the regularization parameter and type to see changes in feature scores.

Feature weights for logistic and linear regression correspond to the absolute value of coefficients of their linear models. To observe their untransformed values in the table, these widgets now also output a data table with feature weights. (At the time of the writing of this blog, this feature has been implemented for linear regression; other classifiers and regressors that can estimate feature weights will be updated soon).

# Data Mining Course in Houston

We have just completed an Introduction to Data Mining, a graduate course at Baylor College of Medicine in Texas, Houston. The course was given in September and consisted of seven two-hour lectures, each one followed with a homework assignment. The course was attended by about 40 students and some faculty and research staff.

This was a challenging course. The audience was new to data mining, and we decided to teach them with the newest, third version of Orange. We also experimented with two course instructors (Blaz and Janez), who, instead of splitting the course into two parts, taught simultaneously, one on the board and the other one helping the students with hands-on exercises. To check whether this worked fine, we ran a student survey at the end of the course. We used Google Sheets and then examined the results with students in the class. Using Orange, of course.

And the outcome? Looks like the students really enjoyed the course

and the teaching style.

The course took advantage of several new widgets in Orange 3, including those for data preprocessing and polynomial regression. The core development team put a lot of effort during the summer to debug and polish this newest version of Orange. Also thanks to the financial support by AXLE EU FP7 and CARE-MI EU FP7 grants and grants by the Slovene Research agency, we were able to finish everything in time.

# Orange in Pavia, Italy

These days, we (Blaz Zupan and Marinka Zitnik, with full background support of entire Bioinformatics Lab) are running a three-day course on Data Mining in Python. Riccardo Bellazzi, a professor at University of Pavia, a world-renown researcher in biomedical informatics, and most of all, a great friend, has invited us to run the elective course for Pavia’s grad students. The enrollment was, he says, overwhelming, as with over 50 students this is by far the best attended grad course at Pavia’s faculty of engineering in the past years.

We have opted for the hands-on course and a running it as a workshop. The lectures include a new, development version of Orange 3, and mix it with numpy, scikit-learn, matplotlib, networkx and bunch of other libraries. Course themes are classification, clustering, data projection and network analysis.

# Towards Orange 3

We are rushing, full speed ahead, towards Orange 3. A complete revamp of Orange in Python 3 changes its data model to that of numpy, making Orange compatible with an array of Python-based data analytics. We are rewriting all the widgets for visual programming as well. We have two open fronts: the scripting part, and the widget part. So much to do, but it is going well: the closed tasks for widgets are those on the left of Anze (the board full of sticky notes), and those open, in minority, are on Anze’s right. Oh, by the way, it’s Anze who is managing the work and he looks quite happy.

By a popular demand, we have just published a tutorial on how to load the data table into Orange. Besides its own .tab format, Orange can load any tab or comma delimited data set. The details are though in writing header rows that tell Orange about the type and domain of each attribute. The tutorial is a step-by-step description on how to do this and how to transfer the data from popular spreadsheet programs like Excel.

# Hands-on Orange at Functional Genomics Workshop

Last week we have co-organized a Functional Genomics Workshop. At University of Ljubljana we have hosted an inspiring pack of scientists from the Donnelly Centre for Cellular and Biomolecular Research from Toronto. Part of the event was a hands-on workshop Data mining without programing, where we have used Orange to analyze data from systems biology. Data included a subset of Charlie Boone’s famous yeast interaction data and data from chemical genomics. For the program, info about the speakers, and panckages and šmorn check out workshop’s newspaper.

It is always a pleasure seeing a packed lecture room with all laptops running Orange. Attendees were assisted by members of the Biolab in Ljubljana. Hands-on program followed a set of short lectures we have crafted for intended audience – biologists. Everything ran smoothly. At the end, we got excited enough to promise a data import wizard for all those that have problems annotating the data with feature type tags. The deadline: two weeks from the end of the workshop.

# Brief History of Orange, Praise to Donald Michie

Informatica has recently published our paper on the history of Orange. The paper is a post-publication from a Conference on 100 Years of Alan Turing and 20 Years of Slovene AI Society, where Janez Demšar gave a talk on the topics.

History of Orange goes all the way back to 1997, when late Donald Michie had an idea that machine learning needs an open toolbox for machine learning. To spark the development, we co-organized WebLab97 at beautiful Bled, Slovenia. Workshop’s name reflected Michie’s idea that tool should be a web application where people can submit data mining code, procedures, testing scripts, and data and share them in the joint web workspace.

Donald Michie, a pioneer of Artificial Intelligence, was always ahead of time. (Check out a great talk by Ivan Bratko on their friendship and adventures in chess and machine learning). At WebLab97, Michie was actually very, very ahead of time. But despite the presence of IBM’s Java team that could guide us in developments of the toolbox, the technology was not ripe and initiative of WebLab was gone as the conference ended. But, at least for us, the idea sparked interest of Janez and myself, and development of what is now Orange begun shortly after.

Our paper gives brief account of Orange’s history and its developments since WebLab97. For reasons of brevity it does not mention that prior to Qt we have experimented with other GUI platforms. Prior to Qt, we laid our hopes to Pwm Python megawidgets, a library that helped us to construct the first Orange graphical user interface. The GUI part of Orange was called Orange*First. Its screenshot shows a tab for interactive discretisation, thanks to Noriaki Aoki who then proposed that this kind of visualisation should be useful in medical data analysis:

PS Somehow, I have lost a latex file with a WebLab97 program. It should be on some backup tape, somewhere. The following scan of the first page (and a weblab97.pdf), left in some PPT presentation, is all that I can retrieve. The program of the second day is missing, with keynotes from Tom Mitchell, and much talk about then already a success story of R.

# JMLR Publishes Article on Orange

Journal of Machine Learning Research has just published our paper on Orange. In the paper we focus on its Python scripting part. We have last reported on Orange scripting at ECML/PKDD 2004. The manuscript was well received (over 270 citations on Google Scholar), but it is now entirely outdated. This was also our only formal publication on Orange scripting. With publication in JMLR this is now a current description of Orange and will be, for a while :-), Orange’s primary reference.

Here’s a reference:

Demšar, J., Curk, T., & Erjavec, A. et al. Orange: Data Mining Toolbox in Python; Journal of Machine Learning Research 14(Aug):2349−2353, 2013.

and bibtex entry:

```@article{JMLR:demsar13a,
author  = {Janez Dem\v{s}ar and Toma\v{z} Curk and Ale\v{s} Erjavec and
\v{C}rt Gorup and Toma\v{z} Ho\v{c}evar and Mitar Milutinovi\v{c} and
Martin Mo\v{z}ina and Matija Polajnar andMarko Toplak and
An\v{z}e Stari\v{c} and Miha \v{S}tajdohar and Lan Umek and
Lan \v{Z}agar and Jure \v{Z}bontar and Marinka \v{Z}itnik and
Bla\v{z} Zupan},
title   = {Orange: Data Mining Toolbox in Python},
journal = {Journal of Machine Learning Research},
year    = {2013},
volume  = {14},
pages   = {2349-2353},
url     = {http://jmlr.org/papers/v14/demsar13a.html}
}```

# Orange 2.7

Orange 2.7 is out with a major update in the visual programming environment. Redesigned interface, new widgets, welcome screen with workflow browser. Text annotation and arrow lines in workspace. Preloaded workflows with annotations. Widget menu and search can now be activated through key press (open the Settings to make this option available). Extended or minimised widget tab. Improved widget browsing. Enjoy!