In 2012, Alex Krizhevsky et.al. demonstrated a deep convolutional neural network that was trained end-to-end to classify images in the Imagenet competition . Achieving a top-5 error of 15.3%, this was the first time deep learning proved to be outperforming traditional image classification techniques on such a large classification task.
The Imagenet dataset contains 1.2 million images with 1000 different classes. End-to-end learning means that the model takes the raw images as input, and learns all features necessary for the classification internally during the training phase. Traditional image classification techniques have relied on hard engineering work to extract such features from the images.
Since then we have seen many different demonstrations of deep neural networks that master problems with an astonishing diversity. Researchers have demonstrated the ability of training larger and larger models with impressive modelling capacity and the trend does not seem to flatten out any time soon. Today the reigning champion on the Imagenet competition is a deep convolutional network presented by Mingxing Tan and Quoc V. Le in 2019  with a top-5 error of only 2.9%. This model contains 66 million parameters and xx layers.
The ability for a machine learning model to generalize to unseen data follows a few principles. The bias-variance trade-off states that a model that makes strong and erroneous assumptions about the data may tend to underfit and perform unsatisfactory, while a model with high variance may overfit (perform well on the training data while not generalizing well to the test data). The high capacity of deep neural networks is an example of high variance, but they seem to have just the right amount of bias to be able to generalize, despite having millions of parameters (this is often much higher than the number of training examples) and hundreds of layers. Exactly how to explain this is still an open question.
Some of the benefits from training large, deep networks have been partly explained. A body of exciting theoretical work has already been published, exploring what happens when neural networks get over-parameterized , or even grow towards infinite size . I look forward in anticipation towards more work in this direction.
While deep neural networks do perform outstandingly well on a large number of tasks, they are sometimes described as opaque and difficult to understand. This becomes especially critical in times of growing awareness on data privacy (the GDPR states that AI systems need to be able to provide explanations for their decisions), and of increasing interest of leveraging advanced AI-based analytics in health care. These are concerns that one should not take lightly, which is why this is the focus of much research lately and also the focus of work being done by AI researchers at RISE.
To understand the learned parameters in a deep convolutional network as mentioned above, we often turn to inspecting the convolutional filters. While data is propagated through a trained neural network, it passes through a number of layers, that transforms the data into signals that are useful for the end task. In the early layers (performing the initial processing of the data), the filters can be directly visualized and inspected, providing insight in the low-level features that the system reacts to. In subsequent layers, the task is more involved. People have turned to optimization techniques to find the inputs that the internal units react to. In 2012, Quoc V. Le, et.al. trained a deep neural autoencoder on randomly selected frames from Youtube videos . An autoencoder is an unsupervised learning model trained to compute vector representations for objects, suitable for further use in later processing stages. In this work, the authors showed using constraint numerical optimization that there were units deep in the model that detected human faces and bodies, and other units that detected cat faces. Interestingly, the findings match the firing of neurons in our own human brains, following a hierarchical structure of simple, concrete patterns being detected close to the eyes, and more abstract concepts being detected deeper into the optical cortex .
Deep neural networks trained for some other tasks employ attention mechanisms , which allows the system to learn the dependencies between specific parts of data. The attention mechanism is another example of inspectable components of neural networks, clearly demonstrating where the model puts its focus during computation.
Within AI research at RISE we work on strategies for demonstrating the dense distributed representations learned by a deep learning model, and to develop machine learning models that can demonstrate how they come to the conclusions that they do. Expect more on explainable AI in a future blog post.
Hubel and Weisel was awarded the Nobel prize in physiology or medicine in 1981 for their work on explaining the neural structure of biological brains. After this was announced, Hubel said: “There has been a myth that the brain cannot understand itself. It is compared to a man trying to lift himself by his own bootstraps. We feel that is nonsense. The brain can be studied just as the kidney can.”
While we similarly have had difficulties understanding artificial neural networks, the landscape is changing, and new insights surface now and then (within a fast-moving field such as machine learning, this means rather frequently), shedding more light on their inner workings.
Sanjeev Arora, Nadav Cohen, and Elad Hazan. "On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization." International Conference on Machine Learning. (2018).
Arthur Jacot, Franck Gabriel, Clement Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Neural information processing systems. (2018).
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet classification with deep convolutional neural networks, Neural information processing systems. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-con… (2012).
Mingxing Tan, Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural networks. International conference on machine learning. https://arxiv.org/abs/1905.11946 (2019).
Quoc Le, et.al., Building high-level features using large scale unsupervised learning. International conference on machine learning. https://www.icml.cc/2012/papers/73.pdf (2012).
David H. Hubel and Torsten Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology 160, 106–154. (1962).
Dzmitry Bahdanau, KyungHuyn Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Translate and Align. International conference on learning representations. https://arxiv.org/abs/1409.0473 (2015).