Image Analytics: Clustering

Data does not always come in a nice tabular form. It can also be a collection of text, audio recordings, video materials or even images. However, computers can only work with numbers, so for any data mining, we need to transform such unstructured data into a vector representation.

For retrieving numbers from unstructured data, Orange can use deep network embedders. We have just started to include various embedders in Orange, and for now, they are available for text and images.

Related: Video on image clustering

Here, we give an example of image embedding and show how easy is to use it in Orange. Technically, Orange would send the image to the server, where the server would push an image through a pre-trained deep neural network, like Google’s Inception v3. Deep networks were most often trained with some special purpose in mind. Inception v3, for instance, can classify images into any of 1000 image classes. We can disregard the classification, consider instead the penultimate layer of the network with 2048 nodes (numbers) and use that for image’s vector-based representation.

Let’s see this on an example.

Here we have 19 images of domestic animals. First, download the images and unzip them. Then use Import Images widget from Orange’s Image Analytics add-on and open the directory containing the images.

We can visualize images in Image Viewer widget. Here is our workflow so far, with images shown in Image Viewer:

But what do we see in a data table? Only some useless description of images (file name, the location of the file, its size, and the image width and height).

This cannot help us with machine learning. As I said before, we need numbers. To acquire numerical representation of these images, we will send the images to Image Embedding widget.

Great! Now we have the numbers we wanted. There are 2048 of them (columns n0 to n2047). From now on, we can apply all the standard machine learning techniques, say, clustering.

Let us measure the distance between these images and see which are the most similar. We used Distances widget to measure the distance. Normally, cosine distance works best for images, but you can experiment on your own. Then we passed the distance matrix to Hierarchical Clustering to visualize similar pairs in a dendrogram.

This looks very promising! All the right animals are grouped together. But I can’t see the results so well in the dendrogram. I want to see the images – with Image Viewer!

So cool! All the cow family is grouped together! Now we can click on different branches of the dendrogram and observe which animals belong to which group.

But I know what you are going to say. You are going to say I am cheating. That I intentionally selected similar images to trick you.

I will prove you wrong. I will take a new cow, say, the most famous cow in Europe – Milka cow.

This image is quite different from the other images – it doesn’t have a white background, it’s a real (yet photoshopped) photo and the cow is facing us. Will the Image Embedding find the right numerical representation for this cow?

Indeed it has. Milka is nicely put together with all the other cows.

Image analytics is such an exciting field in machine learning and now Orange is a part of it too! You need to install the Image Analytics add on and you are all set for your research!

29 thoughts on “Image Analytics: Clustering

  1. Regarding the option cross-validation in the Test and Score widget, is it nested? In other-words am I allowed to do feature selection (i.e. in preprocessing to use Chi2 to reduce the number of features) on the training set and then use these features on the test set to measure my models performance or this would lead to a bias overestimation? Again many thanks for the brilliant platform you developed and provided to the community!

  2. The image embeddings could be considered as a transfer learning method?
    How is it really working when uploading the local images to the server, what exactly is happening? Is it doing some kind of fine tuning on the last fully connected layer using the local images? Thanks again for the great platform

    1. Yes, image embedding is a transfer learning method. We are using the penultimate layer of the trained network to extract image vectors. So not the last layer with class probabilities, but the one before that, since we are interested in embeddings, a kind of image descriptors which can then be used either for clustering or classification.

        1. Dear Nickolas, it seems a bit strange you would use both predictions and test&score on the same data. I think it would make the most sense to train the model with T&S on train data, test the parameters with T&S’s Test Data input, measure the AUC of the validation data, then apply the model in Predictions on test (holdout) data. You could have a 50/25/25 split. If you don’t need to validate the hyperparameters, then T&S with 70/30 is enough. Performance is best measures with T&S.

        1. I mean, it is correct, but unnecessarily complicated. I would suggest you just stay with Test&Score, unless you have unlabelled data you wish to predict. I think Import Images – Image Embedding – Test&Score with cross validation should be enough, using just one LogReg.

        2. Thanks Ajda, How should I cite the Orange3 platform since I’m planning to publish some results? Is there a specific paper describing the method that you use for Image Embedding?

          1. Dear Nickolas, the paper for Image Embedding is in the process of publication. Surprisingly, it is not easy to get this published, even though we think it’s quite a cool approach. For now, you can cite core Orange as stated here:

  3. Hi, I just install the “image analytics” add on but the “image embedding” widget is lacking !
    Orange run from anaconda and my OS is windows 7 pro.
    Thanks for your help

  4. I Was able to download the image analytics add on. However there is no image embeddings widget. I am using orange 3.8 on windows 10 any help would be appreciated.

  5. Hi I’m trying to install Image Analytics (latest version 0.1.7) for my Orange 3.4.5 on my Window machine. Got the following error:

    An error occurred while running a subprocess
    Command failed: python -m pip install Orange3-ImageAnalytics exited wtih non zero status

    After I clicked ‘Show Details’:

    Collecting Orange3-ImageAnalytics
    Using cached Orange3-ImageAnalytics-0.1.7.tar.gz
    Complete output from command python

    Command “python setup.py_egg_info” failed with error code 1 in C:UserssomethingAppDataLocalTemppip-build-59_vln2oOrange3-ImageAnalytics

    Any suggestion to get around it?

      1. I access Orange via Anaconda. I also tried conda install orange3-imageanalytics at Anaconda Command Prompt. Same error. You mean… you can install imageanalytics okay?

          1. Seems like Image Analytics has not yet been ported to Anaconda. A simple pip install orange3-imageanalytics should do.

    1. Hi

      I am not sure if you were successful in installing the image analytics add-on.
      I was also getting an error on Anaconda3/Orange3.8.0 on my Windows10 PC.
      Within the the detailed error log I had a “…Visual C++ 14.0 is required…” message.
      If you are running into the same error then, this article may solve it for you.
      It did for me! (I had Microsoft Visual C++ 14.0 already installed but had to install the Build Tool)
      Good luck if you are still trying to solve the error.


    2. I’ve got the same problem. I am using the python version. I solved the problem by the following steps.

      1. In start menu, find “Orange Command Prompt”
      2. Right click, select “Run as Administrator”
      3. In the prompt, run “python -m pip install Orange3-ImageAnalytics”
      4. Open Orange again and you should find “Image Analytics” with 3 widgets inside: Import Images, Image Viewer and Image Embedding. Enjoy!

  6. Orange is by far the best image analytic solution ever….. It’s awesome…..Thanks to the community of developers for developing this great tool….

Leave a Reply