Classification Trees are great, but how about when they overgrow even your 27” screen? Can we make the tree fit snugly onto the screen and still tell the whole story? Well, yes we can.
Pythagorean Tree widget will show you the same information as Classification Tree, but way more concisely. Pythagorean Trees represent nodes with squares whose size is proportionate to the number of covered training instances. Once the data is split into two subsets, the corresponding new squares form a right triangle on top of the parent square. Hence Pythagorean Tree. Every square has the color of the prevalent, with opacity indicating the relative proportion of the majority class in the subset. Details are shown in hover balloons.
When you hover over a square in Pythagorean Tree, a whole line of parent and child squares/nodes is highlighted. Clicking on a square/node outputs the selected subset, just like in Classification Tree.
Another amazing addition to Orange’s Visualization set is Pythagorean Forest, which is a visualization of Random Forest algorithm. Random Forest takes N samples from a data set with N instances, but with replacement. Then a tree is grown for each sample, which alleviates the Classification Tree’s tendency to overfit the data. Pythagorean Forest is a concise visualization of Random Forest, with each Pythagorean Tree plotted side by side.
This makes Pythagorean Forest a great tool to explain how Random Forest works or to further explore each tree in Pythagorean Tree widget.
Pythagorean trees are a new addition to Orange. Their implementation has been inspired by a recent paper on Generalized Pythagoras Trees for Visualizing Hierarchies by Fabian Beck, Michael Burch, Tanja Munz, Lorenzo Di Silvestro and Daniel Weiskopf that was presented in at the 5th International Conference on Information Visualization Theory and Applications in 2014.
Visualizing relations between data instances can tell us a lot about our data. Let’s see how this works in Orange. We have a data set on machine learning and data mining conferences and journals, with the number of shared authors for each publication venue reported. We can estimate similarity between two conferences using the author profile of a conference: two conference would be similar if they attract the same authors. The data set is already 9 years old, but obviously, it’s about the principle. 🙂 We’ve got two data files: one is a distance file with distance scores already calculated by Jaccard index and the other is a standard conferences.tab file.
We load .tab file with the File widget (data set already comes with Orange) and .dst file with the Distance File widget (select ‘Browse documentation data sets’ and choose conferences.dst).
Now we would like to create a graph from the distance file. Connect Distance File to Network from Distances. In the widget, we’ve selected a high distance threshold, because we would like to get more connections between nodes. We’ve also checked ‘Include also closest neighbors’ to see each node connected with at least one other node.
We can visualize our graph in Network Explorer. What we get is a quite uninformative network of conferences with labelled nodes. Now for the fun part. Connect the File widget with Network Explorer and set the link type to ‘Node Data’. This will match the two domains and display additional labelling options in Network Explorer.
We’ve colored the nodes by type and set the size of the nodes to the number of authors per conference/paper. Finally, we’ve set the node label to ‘name’. Seems like International Conference on AI and Law and AI and Law journal are connected through the number of shared authors. Same goes for AI in Medicine in Europe conference and AI and Medicine journal. Connections indeed make sense.
There are many other things you can do with the Networks add-on in Orange. You can color nodes by predictions, highlight misclassifications or output only nodes with certain network parameters. But for today, let this be it.
Google Summer of Code is progressing nicely and some major improvements are already live! Our students have been working hard and today we’re thanking Alexey for his work on Text Mining add-on. Two major tasks before the midterms were to introduce Twitter widget and rehaul Preprocess Text. Twitter widget was designed to be a part of our summer school program and it worked beautifully. We’ve introduced youngsters to the world of data mining through social networks and one of the most exciting things was to see whether we can predict the author from the tweet content.
Twitter widget offers many functionalities. Since we wanted to get tweets from specific authors, we entered their Twitter handles as queries and set ‘Search by Author’. We only included Author, Content and Date in the query parameters, as we want to predict the author only on the basis of text.
Provide API keys.
Insert queries separated by newline.
Search by content, author or both.
Set date (1 week limit from tweepy module).
Select language you want your tweets to be in.
If ‘Max tweets’ is checked, you can set the maximum number of tweets you want to query. Otherwise the widget will provide all tweets matching the query.
If ‘Accumulate results’ is checked, new queries will be appended to the old ones.
Select what kind of data you want to retrieve.
Press ‘Search’ to start your query.
We got 208 tweets on the output. Not bad. Now we need to preprocess them first, before we do any predictions. We transformed all the words into lowercase and split (tokenized) them by words. We didn’t use any normalization (below turned on just as an example) and applied a simple stopword removal.
Information on the input and output.
Transformation applies basic modifications of text.
Tokenization splits the corpus into tokens according to the selected method (regexp is set to extract only words by default).
Normalization lemmatizes words (do, did, done –> do).
Filtering extracts only desired tokens (without stopwords, including only specified words, or by frequency).
Then we passed the tokens through a Bag of Words and observed the results in a Word Cloud.
Then we simply connected Bag of Words to Test & Score and used several classifiers to see which one works best. We used Classification Tree and Nearest Neighbors since they are easy to explain even to teenagers. Especially Classification Tree offers a nice visualization in Classification Tree Viewer that makes the idea of the algorithm easy to understand. Moreover we could observe the most distinctive words in the tree.
Do these make any sense? You be the judge. 🙂
We checked classification results in Test&Score, counted misclassifications in Confusion Matrix and finally observed them in Corpus Viewer. k-NN seems to perform moderately well, while Classification Tree fails miserably. Still, this was trained on barely 200 tweets. Perhaps accumulating results over time might give us much better results. You can now certainly try it on your own! Update your Orange3-Text add-on or install it via ‘pip install Orange3-Text’!
Above is the final workflow. Preprocessing on the left. Testing and scoring on the right bottom. Construction of classification tree right and above.