Every week on Friday, when the core team of Orange developers meets, we are designing new improvements of Orange’s graphical interface. This time, it was the status bar. Well, actually, it was the status bar quite a while ago and required the change of the core widget library, but it is materializing these days and you will see the changes in the next release.
Consider the Neighbors widget. The widget considers the input data and reference data items, and outputs instance form input data that are most similar to the references. Like, if the dolphin is a reference, we would like to know which are the three most similar animals. But this is not what want I wanted to write about. I would only like to say that we are making a slight change in the user interface. Below is the Neighbors widget in the current release of Orange, and the upcoming one.
See the difference? We are getting rid of the infobox on the top of the control tab, and moving it to the status bar. In the infobox widgets typically display what is in their input and what is on the output after the data has been processed. Moving this information to the status bar will make widgets more compact and less cluttered. We will similarly change the infoboxes in this way in all of the widgets.
Molecular biologists have in the past twenty years invented technologies that can collect abundant experimental data. One such technique is single-cell RNA-seq, which, very simplified, can measure the activity of genes in possibly large collections of cells. The interpretation of such data can tell us about the heterogeneity of cells, cell types, or provide information on their development.
Typical analysis toolboxes for single-cell data are available in R and Python and, most notably, include Seurat and scanpy, but they lack interactive visualizations and simplicity of Orange. Since the fall of 2017, we have been developing an extension of Orange, which is now (almost) ready. It has even been packed into its own installer. The first real test of the software was in early 2018 through a one day workshop at Janelia Research Campus. On March 6, and with a much more refined version of the software, we have now repeated the hands-on workshop at the University of Pavia.
The five-hour workshop covered both the essentials of data analysis and single cell analytics. The topics included data preprocessing, clustering, and two-dimensional embedding, as well as working with marker genes, differential expression analysis, and interpretation of clusters through gene ontology analysis.
I want to thank Prof. Dr. Riccardo Bellazzi and his team for the organization, and Erasmus program for financial support. I have been a frequent guest to Pavia, and learn something new every time I am there. Besides meeting new students and colleagues that attended the workshop and hearing about their data analysis challenges, this time I have also learned about a dish I had never had before in all my Italian travels. For one of the dinners (thank you, Michela) we had Pizzoccheri. Simply great!
Test & Score is surely one the most used widgets in Orange. Fun fact: it is the fourth in popularity, right after Data Table, File and Scatter Plot. So let us dive into the nuts and bolts of the Test & Score widget.
The widget generally accepts two inputs – Data and Learner. Data is the data set that we will be using for modeling, say, iris.tab that is already pre-loaded in the File widget. Learner is any kind of learning algorithm, for example, Logistic Regression. You can only use those learners that support your type of task. If you wish to do classification, you cannot use Linear Regression and for regression you cannot use Logistic Regression. Most other learners support both tasks. You can connect more than one learner to Test & Score.
Test & Score will now use each connected Learner and the Data to build a predictive model. Models can be built in different ways. The most typical procedure is cross validation, which splits the data into k folds and uses k – 1 folds for training and the remaining fold for testing. This procedure is repeated, so that each fold has been used for testing exactly once. Test & Score will then report on the average accuracy of the model.
You can also use Random Sampling, which will split the data into two sets with predefined proportions (e.g. 66% : 34%), build a model on the first set and test it on the second. This is similar to CV, except that each data instance can be used more than once for testing.
Leave one out is again very similar to the above two methods, but it only takes one data instance for testing each time. If you have a 100 data instances, then 99 will be used for training and 1 for testing, and the procedure will be repeated a 100 times until every data instance was used once for testing. As you can imagine, this is a very time-intensive procedure and it is recommended for smaller data sets only.
Test on train data uses the whole data set for training and again the same data for testing. Because of overfitting, this will usually overestimate the performance! Test on test data requires an additional data input (Test Data) and allows the user to control both data sets (training and testing) used for evaluation.
Finally, you can also use cross validationby feature. Sometimes, you would have pre-defined folds for a procedure, that you wish to replicate. Then you can use Cross validation by feature to ensure data instances are split into the same folds every time. Just make sure the feature you are using for defining folds is a categorical variable and located in meta attributes.
Another scenario is when you have several examples from the same object, for example several measurements of the same patient or several images of the same plant. Then you absolutely want to make sure that all data instances for a particular object are in the same fold. Otherwise, your model would probably report severely overfitted scores.
In a parallel universe, not so far from ours, Orange’s Correlation widget looks like this.
Quite similar to ours, except that this one shows p-values instead of correlation coefficients. Which is actually better, isn’t it? I mean, we have all attended Statistics 101, and we know that you can never trust correlation coefficients without looking at p-values to check that these correlations are real, right? So why on Earth doesn’t Orange show them?
First a side note. It was Christmas not long ago. Let’s call a ceasefire on the frequentist vs. Bayesian war. Let us, for Christ’s sake, pretend, pardon, agree that null-hypothesis testing is not wrong per se.
The mantra of null-hypothesis significance testing goes like this:
1. Form hypothesis.
2. Collect data.
3. Test hypothesis.
In contrast, the parallel-universe Correlation widget is (ab)used like this:
1. Collect data.
2. Test all possible hypotheses.
3. Cherry pick those that are confirmed.
This is like the Texas sharpshooter who fires first and then draws targets around the shots. You should never formulate hypothesis based on some data and then use this same data to prove it. Because it usually (surprise!) works.
Back to the above snapshot. It shows correlations between 100 vegetables based on 100 different measurements (Ca and Mg content, their consumption in Finland, number of mentions in Star Trek DS9 series, likelihood of finding it on the Mars, and so forth). In other words, it’s all made it up. Just a 100×100 matrix of random numbers with column labels from the simple Wikipedia list of vegetables. Yet the similarity between mung bean and sunchokes surely cannot be dismissed (p < 0.001). Those who like bell pepper should try cilantro, too, because it’s basically one and the same thing (p = 0.001). And I honestly can’t tell black bean from wasabi (p = 0.001).
Here are the p-values for the top 100 most correlated pairs.
import numpy as np
import scipy as sp
a = np.random.random((100, 100))
sorted(stats.pearsonr(a[i], a[j]) for i in range(100) for j in range(i))[:100]
[0.0002774329730584203, 0.0004158786523819104, 0.0005008536192579852,
0.0007211022164265075, 0.0008268675086438253, 0.0010265740674904762,
(...91 values omitted to reduce the nonsense)
0.01844720610938738, 0.018465602922746942, 0.018662079618069056]
First 100 correlations are highly significant.
To learn a lesson we may have failed to grasp at the NHST 101 class, consider that there are 100 * 99 / 2 pairs. What is the significance of the pair at 5-th percentile?
correlations = sorted(stats.pearsonr(a[i], a[j]) for i in range(100) for j in range(i))
npairs = 100 * 99 / 2
print(correlations[int(pairs * 0.05)]
Roughly 0.05. This is exactly what should have happened, because:
This proves only that p-values for the Pearson correlation coefficient are well calibrated (and that Mersenne twister that is used to generate random numbers in numpy works well). In theory, the p-value for a certain statistics (like Pearson’s r) is the probability of getting such or even more extreme value if the null-hypothesis (of no correlation, in our case) is actually true. 5 % of random hypotheses should have a p-value below 0.05, 10 % a value below 10, and 23 % a value below 23.
Imagine what they can do with the Correlations widget in the parallel universe! They compute correlations between all pairs, print out the first 5 % of them and start writing a paper without bothering to look at p-values at all. They know they should be statistically significant even if the data is random.
Which is precisely the reason why our widget must not compute p-values: because people would use it for Texas sharpshooting. P-values make sense only in the context of the proper NHST procedure (still pretending for the sake of Christmas ceasefire). They cannot be computed using the data on which they were found.
If so, why do we have the Correlation widget at all if it’s results are unpublishable? We can use it to find highly correlated pairs in a data sample. But we can’t just attach p-values to them and publish them. By finding these pairs (with assistance of Correlation widget) we just formulate hypotheses. This is only step 1 of the enshrined NHST procedure. We can’t skip the other two: the next step is to collect some new data (existing data won’t do!) and then use it to test the hypotheses (step 3).
Following this procedure doesn’t save us from data dredging. There are still plenty of ways to cheat. It is the most tempting to select the first 100 most correlated pairs (or, actually, any 100 pairs), (re)compute correlations on some new data and publish the top 5 % of these pairs. The official solution for this is a patchwork of various corrections for multiple hypotheses testing, but… Well, they don’t work, but we should say no more here. You know, Christmas ceasefire.
Scatter plots are surely one of the best loved visualizations in Orange. Very often, when we teach, people go back to scatter plots over and over again to see their data. We took people’s love for scatter plots to the heart and we redesigned them a bit to make them even more friendly.
Our favorite still remains the Informative Projections button. This button helps you find interesting visualizations from all the combinations of your data variables. But what does interesting mean? Well, let us look at an example. Which of the two visualizations tells you more about the data?
We’d say it is the right one. Why? Because now we know that the combination of petal length and petal width nicely separates the classes!
Of course, Informative Projections button will only work when you have set a class (target) variable.
In scatter plot, you can set also the color of the data points (class is selected by default), the size of the points and the shape. This means you can add three new layers of information to your data, but we warn you not to overuse them. This usually looks very incomprehensible, even though it packs a lot of information.
You might notice, that in the current version of Orange, you can no longer select discrete attributes in Scatter Plot. This is entirely intentional. Scatter plots are best at showing the relationship between two numeric variables, such as in the two examples above. Categorical variables are much better represented with Box Plots, histograms (in Distributions) or in Mosaic Display.
Above, we have presented the same information for titanic data set in different visualizations, that are particularly suitable for categorical variables.
Scatter plot also enables so cool tricks. Just like in most visualizations in Orange, I can select a part of the data and observe the subset downstream. Or the other way around. I have a particular subset I wish to observe and I can pass it to Scatter Plot widget, which will highlight selected data instances.
This is also true for all other point-based visualizations in Orange, namely t-SNE, MDS, Radviz, Freeviz, and Linear Projection.
You can see there are many great thing you can do with Scatter Plot. Finally, we have added a nice touch to the visualization.
Yes, setting the size of the attribute is now animated! 🙂
In the past few months, Orange has been getting smarter and sleeker.
Since version 3.15.0, Orange remembers which distinct widgets users like to connect, adjusting the sorting on the widget search menu accordingly. Additionally, there is a new look for the Edit Links window coming soon.
Orange recently implemented a basic form of opt-in usage tracking, specifically targeting how users add widgets to the canvas.
The information is collected anonymously for the users that opted-in. We will use this data to improve the widget suggestion system. Furthermore, the data provides us the first insight into how users interact with Orange. Let’s see what we’ve found out from the data recorded in the past few weeks.
There are four different ways of adding a widget to the canvas,
clicking it in the sidebar,
dragging it from the sidebar,
searching for it by right-clicking on canvas,
extending the workflow by dragging the communication channel from a widget.
Among Orange users, the most popular way of adding a new widget is by dragging the communication line from the output widget – we think this is the most efficient way of using Orange too. However, the patterns vary among different widgets.
Users tend to add root nodes such as File via a click or drag from the sidebar, while adding leaf nodes such as Data Table via extension from another widget.
The widget popularity contest goes to: Data Table! Rightfully so, one should always check their data with Data Table.
52% of sessions tracked consisted of no widgets being added (the application just being opened and closed). While some people might really like watching the loading screen, most of these are likely due to the fact that usage is not tracked until the user explicitly opts in.
Each bit of collected data comes at a cost to the privacy of the user. Care was put into minimizing the intrusiveness of data collection methods, while maximizing the usefulness of the collected data.
Initially, widget addition events were planned to include a ‘time since application start’ value, in order to be able to plot a user’s actions as a function of time. While this would be cool, it was ultimately decided that its usefulness is outweighed by the privacy cost to users.
For the keen, data is gathered per canvas session, in the following structure:
This weekend we were in Lisbon, Portugal, at the Why the World Needs Anthropologists conference, an event that focuses on applied anthropology, design, and how soft skills can greatly benefit the industry. I was there to hold a workshop on Data Ethnography, an approach that tries to combine methods from data science and anthropology into a fruitful interdisciplinary mix!
Data ethnography is a novel methodological approach that tries to view social phenomena from two different points of view – qualitative and quantitative. The quantitative approach is using data mining and machine learning methods on anthropological data (say from sensors, wearables, social media, online fora, field notes and so on) trying to find interesting patterns and novel information. The qualitative approach uses ethnography to substantiate the analytical findings with context, motivations, values, and other external data to provide a complete account of the studied phenomenon.
At the workshop, I presented a couple of approaches I use in my own research, namely text mining, clustering, visualization of patterns, image analytics, and predictive modeling. Data ethnography can be used, not only in its native field of computational anthropology, but also in museology, digital anthropology, medical anthropology, and folkloristics (the list is probably not exhaustive). There are so many options just waiting for the researchers to dig in!
However, having data- and tech-savvy anthropologists does not only benefit the research, but opens a platform for discussing the ethics of data science, human relationships with technology, and overcoming model bias. Hopefully, the workshop inspired some of the participants to join me on a journey through the amazing expanses of data science.
In the past couple of weeks we have been working hard on introducing a better language support for the Text add-on. Until recently, Orange supported only a limited number of languages, mostly English and some bigger languages, such as Spanish, German, Arabic, Russian… Language support was most evident in the list of stopwords, normalization and POS tagging.
Stopwords come from NLTK library, so we can only offer whatever is available there. However, TF-IDF already implicitly considers stopwords, so the functionality is already implemented. For POS tagging, we would rely on Stanford POS tagger, that already has pre-trained models available.
The main issue was with normalization. While English can do without lemmatization and stemming for simple tasks, morphologically rich languages, such as Slovenian, perform poorly on un-normalized tokens. Cases and declensions present a problem for natural language processing, so we wanted to provide a tool for normalization in many different languages. Luckily, we found UDPipe, a Czech initiative that offers trained lemmatization models for 50 languages. UDPipe is actually a preprocessing pipeline and we are already thinking about how to bring all of its functionality to Orange, but let us talk a bit about the recent improvements for normalization.
Let us load a simple corpus from Corpus widget, say grimm-tales-selected.tab that contain 44 tales from the Grimm Brothers. Now, pass them through Preprocess Text and keep just the defaults, namely lowercase transformation, tokenization by words, and removal of stopwords. Here we see that we have came as quite a frequent word and come as a bit less frequent. But semantically, they are the same word from the verb to come. Shouldn’t we consider them as one word?
We can. This is what normalization does – it transforms all words into their lemmas or basic grammatical form. Came and come will become come, sons and son will become son, pretty and prettier will become pretty. This will result in less tokens that capture the text better, semantically speaking.
We can see that came became come with 435 counts. Went became go. Said became say. And so on. As we said, this doesn’t work only on verbs, but on all word forms.
One thing to note here. UDPipe has an internal tokenizer, that works with sentences instead of tokens. You can enable it by selecting UDPipe tokenizer option. What is the difference? A quicker version would be to tokenize all the words and just look up their lemma. But sometimes this can be wrong. Consider the sentence:
I am wearing a tie to work.
Now the word tie is obviously a piece of clothing, which is indicated by the word wearing before it. But tie alone can also be the verb to tie. So the UDPipe tokenizer will consider the entire sentence and correctly lemmatize this word, while lemmatization on regular tokens might not. While UDPipe works better, it is also slower, so you might want to work with regular tokenization to speed up the analysis.
Finally, UDPipe does not remove punctuation, so you might end up with words like rose. and away., with the full stop at the end. This you can fix with using regular tokenization and also by select the Regex option in Filtering, which will remove pure punctuation.
This is it. UDPipe contains lemmatization models for 50 languages and only when you click on a particular language in the Language option, will the resource be loaded, so your computer won’t be flooded with models for languages you won’t ever use. The installation of UDPipe could also be a little tricky, but after some initial obstacles, we have managed to prepare packages for both pip (OSX and Linux) and conda (Windows).
We hope you enjoy the new possibilities of a freshly multilingual Orange!
Did you know that Orange has already been to space? Rosario Brunetto (IAS-Orsay, France) has been working on the analysis of infrared images of asteroid Ryugu as a member of the JAXA Hayabusa2 team. The Hayabusa2 asteroid sample-return mission aims to retrieve data and samples from the near-Earth Ryugu asteroid and analyze its composition. Hayabusa2 arrived at Ryugu on June 27 and while the spacecraft will return to Earth with a sample only in late 2020, the mission already started collecting and sending back the data. And of course, a part of the analysis of Hayabusa’s space data has been done in Orange!
Dr. Brunetto is currently working with the first part of the data, namely the macro hyperspectral images of the asteroid. Several tens of thousands of spectra over 70 spectral channels have already been acquired. The main goal of this initial exploration was to constrain the surface composition.
Once the data was preprocessed and cleaned in Python, separate surface regions were extracted in Orange with k-Means and PCA and plotted with the HyperSpectra widget, which comes as a part of the Spectroscopy package. So why was Orange chosen over other tools? Dr. Brunetto says Orange is an easy and friendly tool for complicated things, such as exploring the compositional diversity of the asteroid at the different scales. There are many clustering techniques he can use in Orange and he likes how he can interactively change the number of clusters and the changes immediately show in the plot. This enables the researchers to determine the level of granularity of the analysis, while they can also immediately inspect how each cluster looks like in a hyperspectra plot.
Moreover, one can quickly test methods and visualize the effects and at the same time have a good overview of the workflow. Workflows can also be reused once the new data comes in or, if the pipeline is standard, used on a completely different data set!
We would of course love to show you the results of the asteroid analysis, but as the project is still ongoing, the data is not yet available to the public. Instead, we asked Zélia Dionnet, dr. Brunetto’s PhD student, to share the results of her work on the organic and mineralogic heterogeneity of the Paris meteorite, which were already published.
She analyzed the composition of the Paris meteorite, which was discovered in 2008 in a statue. The story of how the meteorite was found is quite interesting in itself, but we wanted to know more on how the sample was analyzed in Orange. Dionnet had a slightly larger data set, with 16,000 spectra and 1600 wavenumbers. Just like dr. Brunetto, she used k-Means to discover interesting regions in the sample and Hyperspectra widget to plot the results.
At the top, you can see a 2D map of the meteorite sample showing the distribution of the clusters that were identified with k-Means. At the bottom, you see cluster averages for the spectra. The green region is the most interesting one and it shows crystalline minerals, which formed billions of years ago as the hydrothermal processes in the asteroid parent body of the meteorite turned amorphous silicates into phyllosilicates. The purple, on the contrary, shows different micro-sized minerals.
This is how to easily identify the compositional structure of samples with just a couple of widgets. Orange seems to love going to space and can’t wait to get its hands dirty with more astro-data!
In the past month, we had two workshops that focused on text mining. The first one, Faksi v praksi, was organized by the University of Ljubljana Career Centers, where high school students learned about what we do at the Faculty of Computer and Information Science. We taught them what text mining is and how to group a collection of documents in Orange. The second one took on a more serious note, as the public sector employees joined us for the third set of workshops from the Ministry of Public Affairs. This time, we did not only cluster documents, but also built predictive models, explored predictions in nomogram, plotted documents on a map and discovered how to find the emotion in a tweet.
These workshops gave us a lot of incentive to improve the Text add-on. We really wanted to support more languages and add extra functionalities to widgets. In the upcoming week, we will release the 0.5.0 version, which introduces support for Slovenian in Sentiment Analysis widget, adds concordance output option to Concordances and, most importantly, implements UDPipe lemmatization, which means Orange will now support about 50 languages! Well, at least for normalization. 😇
Today, we will briefly introduce sentiment analysis for Slovenian. We have added the KKS 1.001 opinion corpus of Slovene web commentaries, which is a part of the CLARIN infrastructure. You can access it in the Corpus widget. Go to Browse documentation corpora and look for slo-opinion-corpus.tab. Let’s have a quick view in a Corpus Viewer.
The data comes from comment sections of Slovenian online media and contains a fairly expressive language. Let us observe, whether a post is negative or positive. We will use Sentiment Analysis widget and select the Liu Hu method for Slovenian. This is a dictionary based method, where the algorithm sums the positive words and subtracts the sum of negative words. This gives a final score of the post.
We will have to adjust the attributes for a nicer view in a Select Columns widget. Remove all attributes other than sentiment.
Finally, we can observe the results in a Heat Map. The blue lines are the negative posts, while the yellow ones are positive. Let us select the most positive tweets and see, what they are about.
Looks like Slovenians are happy, when petrol gets cheaper and sports(wo)men are winning. We can relate.
Of course, there are some drawbacks of lexicon-based methods. Namely, they don’t work well with phrases, they often don’t consider modern language (see ‘Jupiiiiiii’ or ‘Hooooooraaaaay!’, where the more the letters, the more expressive the word is) and they fail with sarcasm. Nevertheless, even such crude methods give us a nice glimpse into the corpus and enable us to extract interesting documents.
Stay tuned for the information on the release date and the upcoming post on UDPipe infrastructure!