Convolutional neural networks are great. Sometimes.
CNNs have revolutionized computer vision. But they are far from perfect. Here we see why it’s vital to understand the limits of CNNs.
Convolutional Neural Networks (CNNs) have revolutionized how we approach computer vision. They let us tackle problems in an end-to-end fashion, rather than dismantling the problem into constituent parts. This also makes them ideal as the basis for many deep learning systems. However, they are far from perfect and are still not completely understood. So, if you are going to rely on them, it’s vital to understand the limits of CNNs.
What is a CNN?
Convolutional neural networks were developed in the 1980s as a solution for computer vision problems. A CNN is a network with multiple layers of artificial neurons connected together. They are perfect for computer vision because each layer can learn to identify a different set of features in the input image. These features can get progressively more complex, allowing CNNs to perform remarkably well at image recognition.
Modern CNNs often include both convolution layers and pooling layers. Convolution layers look for specific shapes or features within the image. Pooling layers then simplify the resulting convolution into a smaller image. This is shown in the following set of figures.
How are CNNs trained?
CNNs are a form of supervised learning. At each layer in the CNN, the outputs from the previous layer are assigned weights and act as the inputs to the new layer. The assignment of these weights determines how that layer will respond to the input image.
You train the CNN by passing in millions of images that have been annotated with the correct details. For instance, in the MNIST dataset in the above example, each handwritten numeral is correctly labeled with the corresponding digit. Each time you pass in one of the training images, you see the resulting prediction at the output. So, for the MNIST dataset, you’d have 10 outputs for the digits 0-9. Initially, the outputs are purely random. But the system gradually learns to get the right result by adjusting all the weights (a process known as backpropagation). Eventually, you will have a model that can correctly identify handwritten numerals, even if they are quite ambiguous.
Obviously, you want an image recognition system that is able to identify much more than just handwritten numerals. So, you need to use more rich training datasets. One of the most well-known is called ImageNet. This has over 14 million labeled images you can use for training purposes.
The shortcomings of CNNs
One of the biggest problems for CNNs is that they are essentially just learning to spot a pattern in the image. For instance, they are interpreting that a group of pixels shows a human face. The issue is, they will have been trained on a particular dataset. Chances are, this dataset shows human faces that have been photographed to look good. The lighting will be perfect, they will be either full-face or turned slightly. However, they probably didn’t include any photographs of people wearing face masks. Or people hiding their faces with their hands. In short, you will have trained a model to identify flattering portraits and selfies.
This issue was particularly highlighted a few years ago when it was revealed that image recognition systems often can’t identify African American faces accurately. More shockingly, in one infamous case, Google’s image recognition system tagged two Black friends as “gorillas”. This was because the system had not been trained with enough images of Black people.
CNNs aren’t humans
The issue above reflects the fact that CNNs aren’t humans. Let alone superhumans. They lack our ability to extrapolate from what our eyes see to a true 3d representation of the world. As humans, we learn how the world around us works. For instance, if we see a picture of a dog sitting behind a fence, we automatically recognize it as a dog.
However, for a CNN, this is much harder. The other skill we have is to be able to mentally rotate objects in 3D space. For instance, we know that the image below is a human, even at this odd angle. This is because we are able to understand the coordinate frame the picture exists in.
The only way to train CNNs to cope with this is to train them on even more images. Essentially, you need a library showing objects in unusual poses and obscured by other objects. Or you need to do image augmentation, where you make minute changes to the training images, such as rotating them, applying sheers, etc. Even then, you can easily break a neural network by just adding a little noise to an image!
Moving beyond deep neural networks
The problems I highlighted above don’t just apply to image recognition problems. They illustrate a fundamental problem with all CNNs. The root of the issue is that CNNs learn by recognizing a rough pattern, then progressively recognizing more and more details until they can make a prediction. They then assume that that pattern recognition can be duplicated again and again. However, even humans can be duped by false pattern recognition.
Optical illusions are a perfect example. In the image below, do you see a bird or a rabbit? The answer is, you can see both. But you have to change your frame of reference.
So, what can we do to improve CNNs? This is a rich field of current research. One promising approach for visual recognition is to learn from computer graphics.
In the next blog, we will see how Sonasoft has solved this problem and embedded SAIBRE, the engine behind our AI bot factory. To learn how we can help you adopt an AI-first approach in your business, please reach out to us below.
TRY IT FOR FREE
We are confident that our products will exceed your expectations. We want you to try it for free.