Active Learning Fundamentals

Jeremy Cohen


A few days ago, I came across a startup’s website in which the main argument was Active Learning. When I read about it, I instantly felt “They know what they’re doing.”.

When people talk about Artificial Intelligence today, they mean Machine LearningAnd when they mean Machine Learning, what they really mean is supervised learning.

Active learning is another subfield of Machine Learning we call semi-supervised.

I spent a few hours trying to understand the topic, just so I can explain it to you. Today will be an introduction. After this article, you will have a better idea of what Active learning means, and you’ll know if it’s a good thing to have on a website or a resume.

Before we start, receive my daily emails on AI and Self-Driving Cars to help you break into the cutting-edge industry.

This will be used to send you daily emails on self-driving cars. Occasionally, we'll also send exclusive discounts on our online courses. You can unsubscribe at any time.

What is Active Learning?

Today Machine Learning is confused with supervised learning.

One image we see everywhere when getting interested in Machine Learning is this one.

You might also encounter Reinforcement Learning, which is not detailed here.

So, where is Active Learning?

If you’re like most of the Machine Learning Engineers, there is a high chance that you do the following association:

  • Supervised Learning —  Classification or Regression
  • Unsupervised Learning —  Clustering
  • Semi-Supervised Learning—I’ll have a look maybe next year

It’s not your fault, Active Learning tutorials are not as popular as the rest. In fact, when I searched on Youtube, it was a disaster. It wasn’t better on Medium.

Active Learning is a rare topic. Does it deserve more attention?

How is Active Learning used?

The most frequent use case of Active Learning is one we see every day.

Did you ever wonder what happened here?
Are you a human labeller?
What happens if you (voluntarily) choose wrong?

When selecting the traffic lights (or other), you are a link in the Active Learning Process.

Active Learning is used a lot when you need to label data.
Data labeling is expensive, long, and boring.

I once labeled a custom object detection algorithm, it took me hours of repeating the same task.

What is the labeling process for object detection?
Mostly, you fill a txt file with bounding box coordinates and class of each object. You generally install a GitHub project to simply draw on an image and automatically generate a file. Still, it’s very long and tedious.

Now, can Active Learning help here? YES.
Active Learning is used to reduce the labeling time.
How? By only labeling the hardest elements. 

The term “Active” implies that the model is constantly learning
How? By using the newly encountered elements for training

Active Learning Architecture

Let’s say you have a model that is doing okay at recognizing cars. You’d like your model to improve with time, and with new events. In particular, every time you cross a new car, you’d like to add this car to your dataset.

If your car detector is already trained and running, it should recognize at least 90% of the vehicles with high confidence. The other 10% should be either false positive, false negative, or low-confidence detections.
What active learning does is adding every high confidence detection to the dataset, and asking a human to label every low-confidence detections in the remaining 10%.

  • We start with an unlabelled dataset (the world) and a trained model.
  • We run the model on each new unlabelled example, it generates a prediction or label.
  • If the confidence is high, we accept the label and add the element to the dataset. We eventually retrain the model using the new data.
  • If the confidence is low, we ask a human to hand-label the data
    We rerun the model and go back to 3.

👉 Using this technique, we have a model that is constantly improving.

Every time you run your model, it will learn at the same time by adding every new element to the dataset.
When it’s not sure (it should be 10% of the time), you will do it by hand.

When we see traffic lights from Google Recaptcha, we actually see difficult images, where the model is unsure. 
We are the humans that label the images and improve the model. 
If we label it wrong, there should be a consensus between every human involved; a bit like a bagging algorithm (majority vote). Think about it, the model is already struggling with this data point, and we label it wrong for fun.

Is is that simple?

It can be. But as you might have guessed, there is some complexity in it.

Especially, how do we select the elements to label and the elements to validate?

This is the question that creates a whole field of research.
There are a lot of different techniques, I will simply list a few.

  • Uncertainty sampling
  • Query By Committee
  • Expected Error Reduction
  • Expected Model Change
  • Expected Variance Reduction

The one I mentioned in my image is uncertainty sampling.
We select low confidence results and hand-label them.
The other techniques involve going deeper into the model.

One thing to be careful with is the confidence given by the model.

If everything relies on this, we must be sure that the model doesn’t incorrectly label with 99% accuracy.

Active Learning is a growing field.
Let’s keep an eye on it! In 2020, the algorithms will learn using this technique. Just remember you learned it first here!

Go Further, receive my daily emails on AI and Self-Driving Cars to help you break into the cutting-edge industry.

People on this mailing list have access to exclusive discounts on my online courses that will be shared via emails.

Jeremy Cohen

Artificial Intelligence & Self-Driving Car Engineer, Head Dean of France School of AI, and Machine Learning Lecturer.
I started to help aspiring AI & Self-Driving Car Engineers to land their dream job. Working in the industry of the future requires skills and passion. You can build your skills here, where you'll create relevant projects that are used every day in autonomous robots & AI engines.