Pythagorean Trees and Forests

Classification Trees are great, but how about when they overgrow even your 27” screen? Can we make the tree fit snugly onto the screen and still tell the whole story? Well, yes we can.

Pythagorean Tree widget will show you the same information as Classification Tree, but way more concisely. Pythagorean Trees represent nodes with squares whose size is proportionate to the number of covered training instances. Once the data is split into two subsets, the corresponding new squares form a right triangle on top of the parent square. Hence Pythagorean Tree. Every square has the color of the prevalent, with opacity indicating the relative proportion of the majority class in the subset. Details are shown in hover balloons.

ClassificationTree
Classification Tree with titanic.tab data set.

 

PythagoreanTree
Pythagorean Tree with titanic.tab data set.

 

When you hover over a square in Pythagorean Tree, a whole line of parent and child squares/nodes is highlighted. Clicking on a square/node outputs the selected subset, just like in Classification Tree.

PythagoreanTree2
Upon hovering on the square in the tree, the lineage (parent and child nodes) is highlighted. Hover also displays information on the subset, represented by the square. The widget outputs the selected subset.

 

Another amazing addition to Orange’s Visualization set is Pythagorean Forest, which is a visualization of Random Forest algorithm. Random Forest takes N samples from a data set with N instances, but with replacement. Then a tree is grown for each sample, which alleviates the Classification Tree’s tendency to overfit the data. Pythagorean Forest is a concise visualization of Random Forest, with each Pythagorean Tree plotted side by side.

PythagoreanForest
Different trees are grown side by side. Parameters for the algorithm are set in Random Forest widget, then the whole forest is sent to Pythagorean Forest for visualization.

 

This makes Pythagorean Forest a great tool to explain how Random Forest works or to further explore each tree in Pythagorean Tree widget.

schema-pythagora

Pythagorean trees are a new addition to Orange. Their implementation has been inspired by a recent paper on Generalized Pythagoras Trees for Visualizing Hierarchies by Fabian Beck, Michael Burch, Tanja Munz, Lorenzo Di Silvestro and Daniel Weiskopf that was presented in at the 5th International Conference on Information Visualization Theory and Applications in 2014.

Network Analysis with Orange

Visualizing relations between data instances can tell us a lot about our data. Let’s see how this works in Orange. We have a data set on machine learning and data mining conferences and journals, with the number of shared authors for each publication venue reported. We can estimate similarity between two conferences using the author profile of a conference: two conference would be similar if they attract the same authors. The data set is already 9 years old, but obviously, it’s about the principle. 🙂 We’ve got two data files: one is a distance file with distance scores already calculated by Jaccard index and the other is a standard conferences.tab file.

conferences
Conferences.tab data file with the type of the publication venue (conference or journal) and average number of authors and published papers.

 

We load .tab file with the File widget (data set already comes with Orange) and .dst file with the Distance File widget (select ‘Browse documentation data sets’ and choose conferences.dst).

Distance File Widget
You can find conferences.dst in ‘Browse documentation data sets’.

 

Now we would like to create a graph from the distance file. Connect Distance File to Network from Distances. In the widget, we’ve selected a high distance threshold, because we would like to get more connections between nodes. We’ve also checked ‘Include also closest neighbors’ to see each node connected with at least one other node.

network-from-distances
We’ve set a high distance threshold, since we wanted to display connections between most of our nodes.

 

We can visualize our graph in Network Explorer. What we get is a quite uninformative network of conferences with labelled nodes. Now for the fun part. Connect the File widget with Network Explorer and set the link type to ‘Node Data’. This will match the two domains and display additional labelling options in Network Explorer.

link-to-node-data
Remove the ‘Node Subset’ link and connect Data to Node Data. This will display other attributes in Network Explorer by which you can label and color your network nodes.

 

network-explorer-conferences
Nodes are colored by event type (conference or journal) and adjusted in size by the average number of authors per event (bigger nodes represent larger events).

 

We’ve colored the nodes by type and set the size of the nodes to the number of authors per conference/paper. Finally, we’ve set the node label to ‘name’. Seems like International Conference on AI and Law and AI and Law journal are connected through the number of shared authors. Same goes for AI in Medicine in Europe conference and AI and Medicine journal. Connections indeed make sense.

conference1
The entire workflow.

 

There are many other things you can do with the Networks add-on in Orange. You can color nodes by predictions, highlight misclassifications or output only nodes with certain network parameters. But for today, let this be it.

Rehaul of Text Mining Add-On

Google Summer of Code is progressing nicely and some major improvements are already live! Our students have been working hard and today we’re thanking Alexey for his work on Text Mining add-on. Two major tasks before the midterms were to introduce Twitter widget and rehaul Preprocess Text. Twitter widget was designed to be a part of our summer school program and it worked beautifully. We’ve introduced youngsters to the world of data mining through social networks and one of the most exciting things was to see whether we can predict the author from the tweet content.

Twitter widget offers many functionalities. Since we wanted to get tweets from specific authors, we entered their Twitter handles as queries and set ‘Search by Author’. We only included Author, Content and Date in the query parameters, as we want to predict the author only on the basis of text.

Twitter1-stamped

  1. Provide API keys.
  2. Insert queries separated by newline.
  3. Search by content, author or both.
  4. Set date (1 week limit from tweepy module).
  5. Select language you want your tweets to be in.
  6. If ‘Max tweets’ is checked, you can set the maximum number of tweets you want to query. Otherwise the widget will provide all tweets matching the query.
  7. If ‘Accumulate results’ is checked, new queries will be appended to the old ones.
  8. Select what kind of data you want to retrieve.
  9. Tweet count.
  10. Press ‘Search’ to start your query.

We got 208 tweets on the output. Not bad. Now we need to preprocess them first, before we do any predictions. We transformed all the words into lowercase and split (tokenized) them by words. We didn’t use any normalization (below turned on just as an example) and applied a simple stopword removal.

PreprocessText1-stamped

  1. Information on the input and output.
  2. Transformation applies basic modifications of text.
  3. Tokenization splits the corpus into tokens according to the selected method (regexp is set to extract only words by default).
  4. Normalization lemmatizes words (do, did, done –> do).
  5. Filtering extracts only desired tokens (without stopwords, including only specified words, or by frequency).

Then we passed the tokens through a Bag of Words and observed the results in a Word Cloud.

wordcloud-twitter

Then we simply connected Bag of Words to Test & Score and used several classifiers to see which one works best. We used Classification Tree and Nearest Neighbors since they are easy to explain even to teenagers. Especially Classification Tree offers a nice visualization in Classification Tree Viewer that makes the idea of the algorithm easy to understand. Moreover we could observe the most distinctive words in the tree.

classtree1

Do these make any sense? You be the judge. 🙂

We checked classification results in Test&Score, counted misclassifications in Confusion Matrix and finally observed them in Corpus Viewer. k-NN seems to perform moderately well, while Classification Tree fails miserably. Still, this was trained on barely 200 tweets. Perhaps accumulating results over time might give us much better results. You can now certainly try it on your own! Update your Orange3-Text add-on or install it via ‘pip install Orange3-Text’!

schema-twitter-preprocess

Above is the final workflow. Preprocessing on the left. Testing and scoring on the right bottom. Construction of classification tree right and above.