RNNs in Computer Vision — Image captioning

Feb 18, 2020

Generally, people specialize into either RNN or CNN.

My point is the following: learning both allows better use-cases.

Last week, I tried the final project of the course Introduction to Deep Learning  from HSE (Higher School of Economics). At the end of the article, I will talk a bit about whether I recommend the course or not.

💬 I will also run the project on the Conor Mc Gregor UFC image.

In this project, we learn how to use the output of a Convolutional Neural Network (CNN) for other tasks than classification or regression on images.

This time, we learn how to feed this output into another neural network: a Recurrent Neural Network (RNN). A RNN is a type of neural network that can work with sequences such as text, sound, videos, finance data, …

📩 Before we start, I invite you to join the mailing list by leaving you email! This is the most efficient way to understand autonomous tech in depth and join the industry faster than anyone else.

Why image captioning?

A picture is worth a thousand words, but sometimes we actually want the words.

Let’s just stop for a moment and try to understand the possibilities of image captioning.

If the output is a bunch of words, it means that we are going to use these words.

We can do it for context understanding, or for more detailed scenarios.

Let’s say that you have to identify a specific type of clothes on someone to then recommend style matching clothes. This could change fashion retail forever.

Image captioning here can help understanding the specific clothes of someone and understand the style.

In this example, the detail is not strong enough. It can get better.

Going to the extreme, we could even translate a football match in real-time and replace the commentaries. I am not talking about robot voice here; we could imitate whoever we want.

Just look at that AI that can imitate Joe Rogan: https://fakejoerogan.com


We use 2 neural networks. A CNN and an RNN.

Here I assume you’re a bit familiar with both.

Before getting into technical details, let’s view the dataset and the output we want to generate.


Image — Label

The dataset is a collection of images and captions. Here, it’s the COCO dataset.

For each image, a set of sentences (captions) is used as a label to describe the scene.

It means our final output will be one of these sentences.


The words are converted into tokens in a process called word-embeddings.

The process to convert an image into words/token is as follows:

  • Take an image as an input and embed it
  • Condition the Recurrent Neural Network on that embedding
  • Predict the next token given a START input token
  • Use predicted token as an input at next time step
  • Iterate until you predict an END token
TL;DR — We have images and sentences for each one. Sentences are converted into vectors. We also use a vocabulary of every word we have in the dataset.


The encoder is a Convolutional Neural Network named Inception v3.

This is a popular architecture for image classification.

Inception v3

The code used to compute that CNN with Keras is below.

def get_cnn_encoder():     
    model = keras.applications.InceptionV3(include_top=False)    
    preprocess_for_model = keras.applications.inception_v3.preprocess_input 
    model = keras.models.Model(model.inputs,
    keras.layers.GlobalAveragePooling2D( )(model.output))
    return model, preprocess_for_model

As you can see, the fully-connected layer is cropped with the parameter include_top=False inside the function call. It means that we directly use the convolutional features and we don't activate them to a purpose (classification, regression, ...).

Here; I assume you are already familiar with CNNs and this kind of code.

We simply create an Inception v3 model that we return; we don’t have to create the layers ourselves.


The decoder part is using Recurrent Neural Networks and LSTM cells to generate the captions.

The CNN output is adapted and fed to a Recurrent Neural Network that learns to generate the words.


First, you might notice the vertical layers.

This is what a Recurrent Neural Network produces.

Every vertical layer is trying to predict the next work given the image.

  • The first layer will take the embedded image and predict “start”, then “man is predicted” so the RNN will write “a man”; other tags are then generated such as pizza.
  • We then learn how to say that a man holds a slide of pizza.

The features are used and we try to correlate that with our captions.

In order to get a long-term memory, the RNN type is full of LSTM cells (Long Short Term Memory) that can keep the state of a word.

For example, a man holding ___ beer could be understood as a man holding his beer so the notion of masculinity is preserved here.

Finally, the horizontal layers are, like in Deep Learning, neural net layers.

We could even stack more of these.

Let’s dive into the code to actually visualize it.

The decoder part first uses word embeddings. Let’s analyze the function.

We first define a Decoder class and two placeholders.

In tensorflow, a Placeholder is used to feed data into a model when training. We will have one placeholder for image embedding and one for the sentences.  

class decoder:
    img_embeds = tf.placeholder('float32', [None, IMG_EMBED_SIZE])
    sentences = tf.placeholder('int32', [None, None])

Then, we define functions:

  • img_embed_to_bottleneck will reduce the number of parameters.
  • img_embed_bottleneck_to_h0 will convert the previously gotten image embedding into the initial LSTM cell
  • word_embed will create a word embedding layer: the length of the vocabulary (all existing words)
img_embed_to_bottleneck = L.Dense(IMG_EMBED_BOTTLENECK, input_shape=(None, IMG_EMBED_SIZE), activation='elu')
img_embed_bottleneck_to_h0 = L.Dense(LSTM_UNITS,input_shape=(None, IMG_EMBED_BOTTLENECK),activation='elu')
word_embed = L.Embedding(len(vocab), WORD_EMBED_SIZE)

The next part creates an LSTM cell of a few hundred units.

Finally, the network must predict words. We call these predictions logits and we thus need to convert the LSTM output into logits:

  • token_logits_bottleneck convert the LSTM to logits bottleneck. That reduces the model complexity
  • token_logits convert the bottleneck features into logits using a Dense() layer
  lstm = tf.nn.rnn_cell.LSTMCell(LSTM_UNITS)
  token_logits_bottleneck = L.Dense(LOGIT_BOTTLENECK, input_shape=(None, LSTM_UNITS), activation="elu")
  token_logits = L.Dense(len(vocab), input_shape=(None, LOGIT_BOTTLENECK))

We can then condition our LSTM cell on the image embeddings placeholder.

  • We embed all the tokens but the last
  • Then, we create a dynamic RNN and calculate token logits for all the hidden states. We will use this with the ground truth.
  • We create a loss mask that will take the value 1 for real tokens and 0 otherwise

Finally, we compute a cross-entropy loss, generally used for classification. This loss is used to compare the flat_ground_truth to the flat_token_logits (prediction). `

  c0 = h0 = img_embed_bottleneck_to_h0(img_embed_to_bottleneck(img_embeds))
  word_embeds = word_embed(sentences[:, :-1])
  hidden_states, _ = tf.nn.dynamic_rnn(lstm, word_embeds,initial_state=tf.nn.rnn_cell.LSTMStateTuple(c0, h0))

  flat_hidden_states = tf.reshape(hidden_states, [-1, LSTM_UNITS])
  flat_token_logits = token_logits(token_logits_bottleneck(flat_hidden_states))
  flat_ground_truth = tf.reshape(sentences[:, 1:], [-1])

  flat_loss_mask = tf.not_equal(flat_ground_truth, pad_idx)
  xent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=flat_ground_truth, logits=flat_token_logits)
  loss = tf.reduce_mean(tf.boolean_mask(xent, flat_loss_mask))


Let’s visualize some results on real data.

It’s not all perfect, but there is a solid context understanding.

On the top right image, the woman is confused with a man.

To dive deeper, we might want to train the CNN on more particular things.

Now, what about the UFC image?

I’m a bit disappointed, the model doesn’t understand what UFC is, who Conor is, and what a left hook looks like! We definitely can’t use that model with any image, we need to train the model on UFC examples to get better sentences. However, I’m convinced that we can achieve it.

To learn more about the full project:


📩To learn more about self-driving car technology, I invite you to join the mailing list for exclusive content, career tips, discounts, and more!

Interested in Autonomous Systems? Download the Self-Driving Car Engineer Mindmap

The Self-Driving Car Engineer Mindmap is a video + PDF mindmap showing you the main areas of self-driving cars, and giving you a path to build a career as a self-driving car engineer.