RNNs in Computer Vision — Image captioning

Feb 18, 2020

Generally, people specialize into either RNN or CNN.

My point is the following: learning both allows better use-cases.

Last week, I tried the final project of the course Introduction to Deep Learning from HSE (Higher School of Economics). At the end of the article, I will talk a bit about whether I recommend the course or not.

💬 I will also run the project on the Conor Mc Gregor UFC image.

In this project, we learn how to use the output of a Convolutional Neural Network (CNN) for other tasks than classification or regression on images.

This time, we learn how to feed this output into another neural network: a Recurrent Neural Network (RNN). A RNN is a type of neural network that can work with sequences such as text, sound, videos, finance data, …

📩 Before we start, I invite you to join the mailing list by leaving you email! This is the most efficient way to understand autonomous tech in depth and join the industry faster than anyone else.

Why image captioning?

A picture is worth a thousand words, but sometimes we actually want the words.

Let’s just stop for a moment and try to understand the possibilities of image captioning.

If the output is a bunch of words, it means that we are going to use these words.

We can do it for context understanding, or for more detailed scenarios.

Let’s say that you have to identify a specific type of clothes on someone to then recommend style matching clothes. This could change fashion retail forever.

Image captioning here can help understanding the specific clothes of someone and understand the style.

In this example, the detail is not strong enough. It can get better.

Going to the extreme, we could even translate a football match in real-time and replace the commentaries. I am not talking about robot voice here; we could imitate whoever we want.

Just look at that AI that can imitate Joe Rogan: https://fakejoerogan.com

How?

We use 2 neural networks. A CNN and an RNN.

Here I assume you’re a bit familiar with both.

Before getting into technical details, let’s view the dataset and the output we want to generate.

Dataset

The dataset is a collection of images and captions. Here, it’s the COCO dataset.

For each image, a set of sentences (captions) is used as a label to describe the scene.

It means our final output will be one of these sentences.

Preprocessing

The words are converted into tokens in a process called word-embeddings.

The process to convert an image into words/token is as follows:

Take an image as an input and embed it
Condition the Recurrent Neural Network on that embedding
Predict the next token given a START input token
Use predicted token as an input at next time step
Iterate until you predict an END token

TL;DR — We have images and sentences for each one. Sentences are converted into vectors. We also use a vocabulary of every word we have in the dataset.

Encoder

The encoder is a Convolutional Neural Network named Inception v3.

This is a popular architecture for image classification.

The code used to compute that CNN with Keras is below.

def get_cnn_encoder():     
    K.set_learning_phase(False)
    model = keras.applications.InceptionV3(include_top=False)    
    preprocess_for_model = keras.applications.inception_v3.preprocess_input 
    model = keras.models.Model(model.inputs,
    keras.layers.GlobalAveragePooling2D( )(model.output))
    
    return model, preprocess_for_model

As you can see, the fully-connected layer is cropped with the parameter include_top=False inside the function call. It means that we directly use the convolutional features and we don't activate them to a purpose (classification, regression, ...).

Here; I assume you are already familiar with CNNs and this kind of code.

We simply create an Inception v3 model that we return; we don’t have to create the layers ourselves.

Decoder

The decoder part is using Recurrent Neural Networks and LSTM cells to generate the captions.

The CNN output is adapted and fed to a Recurrent Neural Network that learns to generate the words.

How?

First, you might notice the vertical layers.

This is what a Recurrent Neural Network produces.

Every vertical layer is trying to predict the next work given the image.

The first layer will take the embedded image and predict “start”, then “man is predicted” so the RNN will write “a man”; other tags are then generated such as pizza.
We then learn how to say that a man holds a slide of pizza.

The features are used and we try to correlate that with our captions.

In order to get a long-term memory, the RNN type is full of LSTM cells (Long Short Term Memory) that can keep the state of a word.

For example, a man holding ___ beer could be understood as a man holding his beer so the notion of masculinity is preserved here.

Finally, the horizontal layers are, like in Deep Learning, neural net layers.

We could even stack more of these.

Let’s dive into the code to actually visualize it.

The decoder part first uses word embeddings. Let’s analyze the function.

We first define a Decoder class and two placeholders.

In tensorflow, a Placeholder is used to feed data into a model when training. We will have one placeholder for image embedding and one for the sentences.

class decoder:
    img_embeds = tf.placeholder('float32', [None, IMG_EMBED_SIZE])
    sentences = tf.placeholder('int32', [None, None])

Then, we define functions:

img_embed_to_bottleneck will reduce the number of parameters.
img_embed_bottleneck_to_h0 will convert the previously gotten image embedding into the initial LSTM cell
word_embed will create a word embedding layer: the length of the vocabulary (all existing words)

img_embed_to_bottleneck = L.Dense(IMG_EMBED_BOTTLENECK, input_shape=(None, IMG_EMBED_SIZE), activation='elu')
img_embed_bottleneck_to_h0 = L.Dense(LSTM_UNITS,input_shape=(None, IMG_EMBED_BOTTLENECK),activation='elu')
word_embed = L.Embedding(len(vocab), WORD_EMBED_SIZE)

The next part creates an LSTM cell of a few hundred units.

Finally, the network must predict words. We call these predictions logits and we thus need to convert the LSTM output into logits:

token_logits_bottleneck convert the LSTM to logits bottleneck. That reduces the model complexity
token_logits convert the bottleneck features into logits using a Dense() layer

  lstm = tf.nn.rnn_cell.LSTMCell(LSTM_UNITS)
  token_logits_bottleneck = L.Dense(LOGIT_BOTTLENECK, input_shape=(None, LSTM_UNITS), activation="elu")
  token_logits = L.Dense(len(vocab), input_shape=(None, LOGIT_BOTTLENECK))

We can then condition our LSTM cell on the image embeddings placeholder.

We embed all the tokens but the last
Then, we create a dynamic RNN and calculate token logits for all the hidden states. We will use this with the ground truth.
We create a loss mask that will take the value 1 for real tokens and 0 otherwise

Finally, we compute a cross-entropy loss, generally used for classification. This loss is used to compare the flat_ground_truth to the flat_token_logits (prediction). `

  c0 = h0 = img_embed_bottleneck_to_h0(img_embed_to_bottleneck(img_embeds))
  word_embeds = word_embed(sentences[:, :-1])
  hidden_states, _ = tf.nn.dynamic_rnn(lstm, word_embeds,initial_state=tf.nn.rnn_cell.LSTMStateTuple(c0, h0))

  flat_hidden_states = tf.reshape(hidden_states, [-1, LSTM_UNITS])
  flat_token_logits = token_logits(token_logits_bottleneck(flat_hidden_states))
  flat_ground_truth = tf.reshape(sentences[:, 1:], [-1])

  flat_loss_mask = tf.not_equal(flat_ground_truth, pad_idx)
  xent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=flat_ground_truth, logits=flat_token_logits)
  loss = tf.reduce_mean(tf.boolean_mask(xent, flat_loss_mask))

Results

Let’s visualize some results on real data.

It’s not all perfect, but there is a solid context understanding.

On the top right image, the woman is confused with a man.

To dive deeper, we might want to train the CNN on more particular things.

Now, what about the UFC image?

I’m a bit disappointed, the model doesn’t understand what UFC is, who Conor is, and what a left hook looks like! We definitely can’t use that model with any image, we need to train the model on UFC examples to get better sentences. However, I’m convinced that we can achieve it.

To learn more about the full project:

https://github.com/Jeremy26/image-captioning/blob/master/image_captioning.ipynb

📩To learn more about self-driving car technology, I invite you to join the mailing list for exclusive content, career tips, discounts, and more!

The Ultimate Guide to Medical Image Segmentation with Deep Learning (2D and 3D)

Video Segmentation: Why the shift from image to video processing is essential in Computer Vision

Functional Safety Engineer: The Job that 'certifies' self-driving cars

Faster RCNN in 2025: How it works and why it's still the benchmark for Object Detection

RNNs in Computer Vision — Image captioning

📩 Before we start, I invite you to join the mailing list by leaving you email! This is the most efficient way to understand autonomous tech in depth and join the industry faster than anyone else.

Why image captioning?

How?

Dataset

Preprocessing

Encoder

Decoder

Results

Now, what about the UFC image?

Jeremy Cohen

The Ultimate Guide to Medical Image Segmentation with Deep Learning (2D and 3D)

Video Segmentation: Why the shift from image to video processing is essential in Computer Vision

Functional Safety Engineer: The Job that 'certifies' self-driving cars

Faster RCNN in 2025: How it works and why it's still the benchmark for Object Detection

📩 Before we start, I invite you to join the mailing list by leaving you email! This is the most efficient way to understand autonomous tech in depth and join the industry faster than anyone else.

Why image captioning?

How?

Dataset

Preprocessing

Encoder

Decoder

Results

Now, what about the UFC image?

Interested in Autonomous Systems?

Jeremy Cohen