Welcome to The Library!

Voxel vs Point Based Approaches in 3D Deep Learning: Who wins a fight?

Jeremy Cohen — Tue, 25 Jun 2024 10:29:40 GMT

A few weeks ago, I was celebrating my birthday with friends, when I plugged my TV to see the results of the "European Elections". And then... Lightning Strike! ⚡️ French President Emmanuel Macron, after losing against the extreme parties, decided to dissolve the Assembly.

In other words, every French citizen is invited to "re"vote, but this time for our own government. A lot of people called it a "poker move", others said it put the country in chaos, and a few were eager to vote and replace Macron with whoever they preferred.

France has reached a "US-level" division, with half of the country going far left, and the other half going "far right". Something unseen before. What's interesting is, they all think they have the right solution to fix France.

We now more than ever have an opposition between two ideas, two solutions to solve a problem...

We've seen this with Apple vs Microsoft before; one wanting utmost control of hardware and software; while the other wants a versatile OS working on every possible platform. We've seen it with Musk and Besoz, each having its own solution to win the space race. Or with Intel vs AMD. We've seen it in war tactics, in market disruption strategies, in investment...

... and now... we're seeing a division of two approaches in 3D Deep Learning.

In the 3D Deep Learning world, the fight happens between Voxels and Points. And you'll see how similarly, one approach is conservative, versatile, "safe", and the other is more innovative, disruptive, and precise.

What we're trying to understand is much simpler than french politics — how on earth can we classify a point cloud using Neural Networks?

Point Cloud Classification — When you think about it, it only make sense to begin with classification

Let's take a look...

Point Based Approaches: From PointNet to Point Transformers

Coming from the Computer Vision world, what do you have at disposal to use Deep Learning on Point Clouds? Not so much, but mainly, the Multi-Layer Perceptron, and the 2D Convolution.

Yes, but a point cloud is infinitely harder to process than an image. An image has a fixed width and height, it's a rectangular matrix where every pixel lies between 0 and 255, nearby pixels belong to the same object, and it's all flat 2D.

On the other hand, point clouds are chaotic. There is no order, no color, and no continuity between the points. The structure is 3D, it can rotate or change scale, and points aren't evenly spaced. Any random shuffling or data augmentation could destroy a convolution's output. In other word, it's chaos.

You cannot apply traditional CNNs on Point Clouds, and this is why the PointNet architecture has been invented. At the time, rather than a 2D convolution, PointNet used shared MLPs (implemented via 1x1 convolutions) to learn features from a point cloud. It also used Spatial Transformer Networks and Max Pooling to classify or segment a point cloud.

The architecture from 2016 looks like this:

The PointNet architecture (source)

PointNet can classify a point cloud, segment it, and even segment its subparts. Like this:

Point Cloud Classification using PointNet (taken from my Deep Point Clouds course)

So, this is the "VGG" equivalent. If I had to draw a parallel:

In an image approach, VGG/ResNet is used to extract features, and the rest of the architecture is about doing the task (object detection, segmentation, ...)
In a point cloud approach, PointNet/PointNet++ is used to extract features, and the rest of the architecture does the tasks.

Since PointNet (2016) and PointNet++ (2017), many other extractors and architectures followed, such as PointCNN (2018), DGCNN (2019), PointNeXt (2022), Point-MLP (2022), or even — what's today considered as state-of-the-art, Point Transformers v3 (2023/2024). You can find many of them explained in my Deep Point Clouds course.

⚠️ Now, a note: These are feature extractors. Their purpose is simply to learn features from point clouds directly, you're never going to do 3D Object Detection or 3D Tracking with it. But you can include them in 3D Object Detectors.

Let's see how...

Fitting PointNet inside a 3D Architecture

Let's take for example, Point-RCNN (2019). This is a two-stage 3D Deep Learning algorithm that uses PointNet++ and a well-crafted structure to implement 3D Object Detection:

Point-RCNN on a highway scenario

This is what we're looking for. It's using PointNet, but it's not PointNet, it's an extension of it.

When looking at the architecture, it resembles this:

Point-RCNN architecture (from my Deep Point Clouds course) — Each of the 2 stages uses PointNet++, the evolution of PointNet, as a feature extractor — in the encoder/decoder, but also in the second stage when learning features.

And when you look at it, you begin from a point cloud, then learn features for each point, and then end up adding algorithms, like foreground/background detection, bounding box regression, and others to get to your objective...

Similarly, many approaches exists , and they all use an extractor. CenterPoint (2021) uses PointNet++ and then generates centers of object locations. H3DNet (2020) uses PointNet++ as well. And so on...

Okay, so this is one approach, what is the other?

Voxel Based Approaches

The second idea is not a direct approach; because it involves "voxels". If you've ever heard names like "VoxelNet", or "PointPillars", this is that. So what is a voxel? And why is this the #2 technique?

Let's go back to our idea of 2D Convolution not working on Point Clouds. We saw the point cloud is naturally unstructured, the number of point changes at every frame, and so on... But there is a way to solve this, other than inventing a PointNet... voxelization.

To give an analogy, a voxel is a 3D image. When we have a point cloud, we have 3D shapes that can't work with our 2D Convolutions; but when converting this point cloud to a set of "voxels", we can then use, not 2D, but 3D Convolutions.

So for example, we could split the space (yes, the "air") into 50*50*50 cm grids, and consider these as voxels. You then take the average of the points inside and give it a value. If no point is inside, you consider it empty.

Taking a Point to Voxels and applying 3D Convolutions (from my Deep Point Clouds course)

So , this is the second way, and it's great, because when the space is voxelized, you CAN use 3D Convolutions. You can even implement 3D Convolutional Neural Networks, and basically replicate everything you know about image convolutions to point clouds.

The "end" could look like this:

Example of Voxelization and 3D Classification (taken from my Deep Point Clouds course)

This idea has been replicated in 2D, via an algorithm named "Point Pillar", and had many variations over the years. And just like a PointNet can fit into a Point-RCNN, a voxel network could fit into a Voxel-RCNN.

The output would look like this:

So you now get the two types of approaches:

One uses "PointNets" to learn from points directly
One uses a "Voxelization" process first, to learn using 3D Convolutions

But which one is better?

Comparing the two approaches 🥊

Let's first use our intuition...

Imagine you have a point cloud, and its voxelized version, like this:

Point vs Voxels (source)

Which of the two do you think provides more information? Intuitively, the point cloud hasn't been through a conversion process, it didn't have the "loss of information" we could have from voxelizing a scene — it's also more "precise" around the edges. So it's a big win for the point based approaches!

On the speed aspect, point based approaches don't have this extra voxelization process. They directly work with the raw points, and can even adapt, generate graphs, work with sparse data, and so on... Point based approaches really ARE the innovative solutions.

YES BUT:

Voxels use Convolutions. Are we know extremely well how to stack convolutions together, how to make them efficient, how to use pyramid architectures at multi-scales, how to visualize the features learned, and so, we have amazing well-known ways to create stunning architectures using Voxels. In other words, the "conservative" way is this one.

At first, researchers only knew how to use voxels. Then, in the 2016-2019 era, people shifted to point based approaches since PointNet came. In the meantime, interesting approaches like PointPillars got released on the voxel aspect. And now, we have state of the art approaches for both, and even for point-voxel based approaches!

Okay, let's see a few examples:

Example #1: Point-Transformer

The first example I want to pick is the v1 Point Transformer.

Point Transformer (source)

This architecture works on the points directly, and you can see 3 distinct blocks repeated:

The Point Transformer Block, which implements self-attention and MLPs
The Transition Down, which involves sampling points, finding the K-nearest neighbors, and doing a local max pooling
The Sampling Up, which uses interpolation to upsample the point cloud (similarly to a UNet architecture)

Okay, so this is just for you to "see" the types of operations we can apply on point clouds...

Example #2: VoxelNet

This second example is much simpler to understand, but it's one of the "pillars" that shows extremely well everything going on in a Voxel based approach:

VoxelNet by  (source)

As you can see, it begins with a Feature Learning Network, that doesn't just involve "voxelization", but VFE (Voxel Feature Extraction), stacking voxels together, running fully convolutional networks, etc... and then, using 3D Convolutions in the middle layers before going with a Region Proposal Network for Bounding Box generation.

FYI — I'm doing a complete study of this algorithm in my course "DEEP POINT CLOUDS".

Aren't there other ways? What about Bird Eye View?

Yes, there are other ways! Point and Voxel based approaches are the 2 "main" ways to process point clouds with 3D Deep Learning, but I counted at least 3 more:

Point-Voxel based approaches (Hybrid): This leverages the better of the two worlds, and can get very sophisticated. I'm doing a full study + implementation of one named PV-RCNN in the DLC of my Deep Point Clouds course.
Bird Eye View approaches: If you don't want to convert to voxels, you can also convert to Bird Eye View. This is a bit like "cheating", but Bird Eye View allows you to remove one dimension and work in 2D. This is what the "Point Pillar" algorithm is doing.
Range View approaches: If you don't want to use points, or voxels, or even Bird Eye View, you can use "Range View". A Range view is like a depth map. It's a 2D front image, but with depth values in it. So essentially, you can use normal image convolutions on it.
Anything else? We could use Graph Neural Networks, but this would mean going back to the idea of processing points directly. We could use Occupancy Networks, but these really are voxels. So I think we're good!

The 5 Approaches we can use to process point clouds using 3D Deep Learning (source)

Okay, we've seen a lot of things in this article, let's do a summary.

Summary

If you want to use Deep Learning on Point Clouds, the normal 2D Convolutions won't work. Point clouds are randomly set, they don't obey to a structure, and convolution operations could break them.
We have 2 main ways to process point clouds with Deep Learning: point based (direct processing) and voxel based (voxelization, then convolutions)
Point based approaches began with PointNet, and implement the idea of processing the points directly, by learning features for every point and aggregating everything. Other approaches such as PointNet++, PointCNN, or Point Transformers followed.
Voxel based approaches first convert a point cloud into a voxel grid, and then process the points, this time using 3D Convolutions. The voxelization creates order among chaos, and allows to use 3D Convolutions on point clouds.
You can fit any point or voxel based extractor into an architecture, for example Point-RCNN, VoxelNet, PV-RCNN, and more...
In most cases, you'll need to use additional operations, such as sampling the points, clustering them, detecting foreground/background, etc...
There are other ways to process point clouds. For example, you could turn the point cloud into a Bird Eye View, or a Range View. The best way is currently, I think, the Point-Voxel approaches (hybrid).

Many times, when two ideas conflict eachother, there's a clash. I don't know how the french clash will end, but in the 3D Deep Learning world, we have a point based solution that seems more innovative, but struggled for a long time with available tools.... while the other approach uses well-known techniques, but seems limited by its very structure.

Next Steps

Congratulations on reaching the end of this (quite long) article! If you'd like to go further in the world of 3D Deep Learning, there are many things you can do, such as:

Studying the fundamentals — PointNet, PointNet++, Voxelization, etc... and even everything Machine Learning based (because; why going to 3D Deep Learning first?)

🔥

Going advanced — If you'd like to go advanced, I would recommend to signup for the waitlist of my Deep Point Clouds course. The access is closed most of the year, but it opens once every few months; typically when content has been updated or bonuses get released.

When signing up for the waitlist, you'll also get goodies talking more about 3D Deep Learning while you wait. It's here to learn more.

Why I believe Tesla still secretly uses CNNs in FSD12 (and not just Transformers)

Jeremy Cohen — Thu, 30 May 2024 11:30:52 GMT

Last night, my wife and I were having dinner at friends, when suddenly, out of the blue, one of them said:

— "Hey! About your email from yesterday..."

— "My email?" I asked confused. "You're reading these?" (mentioning my daily emails)

— "Yes! I'm 100% with Yann LeCun. I know you're on Elon Musk team, admit it."

Oh, I hoped he didn't make me pick. But if I have to... YES! I'M WITH ELON MUSK! MWAHAHAHAHA!

Except today!

Well, I don't know about their current feud, but one thing particularly interested me in the CNN/Transformer fight.

Let me give you some context:

Tweet echange between Elon Musk and Yann LeCun (inventor of CNNs)

Now this interests me.

Why "we don't use CNNs much these days?" Is this true? Is it really all Transformers? And are all these people on Twitter saying it's just Transformers complete idiots?

In this article, I am going to tell you who I think is right, and more importantly, I'm going to explain why it's Yann LeCun, and why Tesla will still use CNNs...

Ready?

First, my research is based on Tesla's conferences, the AI Day 2021 and 2022, and CVPR 2023 — and you can find lots of details on my blog posts on Tesla's HydraNet, Tesla's Occupancy Network, and Tesla's End-To-End Architecture.

And in all of these, they share essentially the Perception, done with both a HydraNet (for lane & objects) and an Occupancy Net (for 3D occupancy & flow).

These modules are plugged together with a Deep Planner in an end-to-end fashion.

The question I want to answer is...

"Do either of these use Convolutions?"

Let's see:

Does the HydraNet use convolutions?

Here is the latest available architecture for the HydraNet (as a reminder, the HydraNet is a network that has multiple heads, each head capable of solving one task, you can find more details in my HydraNet course):

The HydraNet Architecture is made of 3 parts: Feature Extraction, Fusion, and Prediction

You can notice 3 key parts:

In blue, feature extraction using RegNets
In green, a Transformer based fusion
In red, the heads doing the prediction (objects, lanes, ...)

And of course, most of it is Transformer based now...

Except one part:

The RegNet& FPNs!

Inside RegNets

RegNet (RNN-Regulated Residual Networks) is the algorithm doing feature extraction, and when looking at the paper, they tell you it's a ConvRNN based feature extractor they use heavily.

ResNet vs RegNet (Tesla uses the RegNet on the right) (source)

So as you can see, it's a better way to do feature extraction (the output looks better at least); and this by using a ConvRNN. Here's the gist of it:

We start with a ResNet design, which is a good feature extractor. The "building block" noted is a set of Conv+BatchNorm+ReLU.

2. At each output, we pass it through a RNN. This is the ConvLSTM architecture; putting a Recurrent Neural Network inside a CNN. This helps with temporal dependencies.

The ConvLSTM (source)

3. We pass it trough the next stage and repeat for dozens of layers. At the end, we do a global pooling of all the features extracted and pass it to the Transformers.

The ConvRNN Architecture. After every convolution, we pass it to a RNN that then passes it again to the next layer. (source)

Okay, so this is what Tesla uses; a RegNet, followed by a FPN: a Feature Pyramid network.

The Feature Pyramid Network leverages information at multiple scales (source)

FYI — I'm talking a lot about RNNs and CNNs together in my Optical Flow course, and a lot about FPNs in my Image Segmentation course.

So we know the HydraNet uses CNNs, now let's see the second one

The Occupancy Network

You can obviously find more details in my dedicated 2,000 words article, but the gist is, feature extraction is also based on RegNets and FPNs!

The occupancy Network also uses the RegNet and FPNs. In this case the FPNs are Bilinear FPNs, which means it's probably for upsampling rather than downsampling

So, I hereby claim that according to his own conference, Elon Musk is....

WRONG!

Now, you're going to tell me:

"Hey Jeremy, how do you know they didn't replace the RegNet with a Transformer?"

And I think that this is unlikely, and I'll tell you why in the second part of this article...

Why Tesla still uses CNNs

To answer this, we have 2 key questions:

Why do CNNs outperform Transformers for feature extraction?
Why did Tesla choose to use CNNs before Transformers, and not Transformers directly?

Why CNNs outperforms Transformers for Feature Extraction

Well, first things first: CNNs have been built for feature extraction.

They have actually been built to replace the manual feature extractors like histogram of oriented gradients. And Transformers have NOT been built for that, they have been built to spot the "attention points" in an image, and to capture the temporal dependencies.

So if you want to use feature extraction, it's much better to do it with a feature extractor, a CNN; it's also much faster, since Transformers are slow.

If you had to cut a piece of steak, would you rather use a meat knife like CNNs, or a swiss-army knife like Transformers?

Now the second question:

Why Tesla doesn't go straight to Transformers

Well, it's likely based on their own research and trial/errors...

But it could also make sense to do:

CNN/Feature Extraction and
Transformers/Fusion

We can use CNNs before Transformers!

It's something I also explain in my Video Transformers Workshop. Because CNNs allow to reduce the dimension, they allow to capture the interesting patterns, they allow to spot both local and global features...

So basically, CNNs are here to make the Transformers job easier and faster. Rather than processing an image, they process features. Transformers are not here as replacement for CNNs, there's here as replacement for LSTMs and RNNs.

It's basically like going to the grocery shop and buying the right ingredients first, and then giving them to the chef — rather than asking the chef to also go to the grocery shop.

"Hey, here are features, find the attention spots!".

You have much less to process.

Okay, I could go on and on, but you get my point: We want to use CNNs for feature extraction, and we want to use Transformers for spatial fusion (all the tesla cameras), attention, and temporal processing (t-1, t-2, ...).

So, now let's do a quick summary:

Summary

If we stick to CVPR 2023, Yann LeCun is right that a company like Tesla still needs to use CNNs. Transformers are slow, and they haven't been built for Feature Extraction.
Tesla has an End-To-End architecture using a HydraNet, an Occupancy Network, and a Deep Planner.
The HydraNet uses a ConvRNN based feature extractor named RegNet, which uses RNNs to 'self-regulate' and get a better output. It also uses FPNs after this.
The Occupancy Network uses this exact same feature extraction technique, but with Bilinear FPNs.
In the end, we realize that CNNs and Transformers can fit right together. They're not necessarily here to replace each other; a CNN is a perfect feature extractor, but performs badly with temporal dependencies and fusion.
A Transformer is a versatile tool, a swiss-knife, that can perfectly handle the temporal issue, but will not be as good a a CNN for feature extraction.

Next Steps

If you liked this article, you'll likely want to learn more about what's going on under-the-hood of the models. I would recommend my 3 blog posts on Tesla's HydraNet, Tesla's Occupancy Network, and Tesla's End-To-End Architecture.

⏯️

I have recorded a 60' webinar called "The Deep Learning SOTA 2024 Conference", and gives you a complete overview of how Deep Learning is used in Self-Driving Cars. Not just CNNs and RNNs; but all the papers used, the approaches, and more. I gave this conference to Mercedes-Benz, and they loved it. I think you will to. Here is where to get access.

2 Ways to do Early Fusion in Self-Driving Cars (and when to use Mid Or Late Fusion)

Jeremy Cohen — Tue, 28 May 2024 15:39:24 GMT

A while ago, I was involved in an open source self-racing car initiative, and was part of a team in charge of Computer Vision. The first thing our team leader asked us to do was a study on object detection algorithms.

— "Okay, let's start by looking at papers and understand together what they're about" I started

— "No, it's better if we each read the papers on our own, and then share our conclusions." said another

— "You guys know how to read papers?" said a third in confusion.

We were stuck.

Which way should use go with? In the engineering world, students love the default technique of having engineers on their own, reviewing papers, and then after a week or two, share their conclusions. I thought this was a waste of time, as the others never really do it, or do it in a rush last second.

But do you see how we have two clear strategies here? We have an "Early Fusion" strategy, which involves fusing our brains right at the beginning, before we even read the paper... And we had a "Late Fusion" strategy, which was about fusing our conclusions after we had done our reading.

And in Sensor Fusion, this is exactly the same, there are "moments" to fuse data. Take, for example, a LiDAR Camera Fusion pipeline. We could do an early fusion of LiDAR point clouds and camera pixels, or a late fusion of bounding boxes. These two procedures for this exact examples are detailed in my article "LiDAR/Camera Sensor Fusion in Self-Driving Cars".

In this post, I'd like us to "zoom" in Early Fusion, understand what it is, what are our "options" to do it, when to do/not do it, and get a good understanding of this.

Let us begin...

What is Early Fusion?

The first thing to understand is that when talking about early or late fusion, we're talking about a "moment". When should we do the fusion? Right at the beginning? A bit later? At the very end? Originally, there are 2 ancestral ways to do the fusion, early and late.

Early vs Late Sensor Fusion of LiDAR and Camera

So as you can see, Early Fusion is about fusing raw data, while late fusion is about fusing outputs/objects.

In reality, there are also other types of fusion, such as Mid-Fusion, which involves fusing the features (after some convolutional neural networks and extraction — usually for image recognition), or sequential fusion, which is the Kalman Filter approach, where you'll consider the LiDAR, then the camera, then the LiDAR again... in a loop.

In this post, I want to do the focus on Early Fusion.

Now the question is... how does it work?

The 2 Ways to do Early Fusion

There are 2 ways to do Early Fusion:

The Geometric way
The Neural Networks way

Let's begin with the traditional way.

The Geometric Way

Let's keep the example of LiDAR and cameras, because I think it's a really good example. Not only the sensors are totally different, but they're also in a different dimension (the camera is 2D and the LiDAR is 3D).

The first approach involves projecting one data onto the other. For example, you could project point clouds on images. Or you "could" project pixels in the 3D space. The first is more commonly done, but the second is also more and more done, especially in Dense SLAM and 3D Reconstruction. Whatever the case, it looks like this:

What Geometric Early Fusion Looks Like

And of course, it's not "easy" to understand how to do this, because a point cloud is 3D, but a pixel is 2D. Yet, if you ever took 3D Computer Vision lessons, you probably heard about camera calibration, intrinsic and extrinsic matrices, coordinate spaces, and you probably saw that there are some math equations allowing us to move from one to the other.

When we are projecting the LiDAR point clouds to the image space, we're essentially moving from one coordinate space to another. A single point in the 3D Space is first converted to the Camera Space (3D), and then moved to the Image Space (2D). A formula can even help us achieve this:

A Formula to project 3D Points into 2D Pixels (you can learn how to apply it in my Visual Fusion course)

And when using these parameters properly, you end up here:

Projecting a single 3D point to an image — the level of understanding needed to really "get" Visual Fusion (image from my Visual Fusion course)

A very obvious (yet worth mentioning) note here is that the sensors need to be synchronized.

This isn't the only way to do, because you can also project pixels to the 3D space. And I say pixels, but it could also be visual features. In Visual SLAM, we often refer to these techniques as Sparse and Dense 3D Reconstructions. Or Direct and Indirect. In my SLAM course, there is a project where you learn to do 3D Mapping using exactly this idea of Feature Fusion.

In this image, we are not fusing pixels with points, but features with points.

Visual Feature Fusion — We're already in the Mid-Fusion here, but still in the geometric level. Fusing Pixels with Points is not easy thing to do, but fusing features with points makes much more sense (taken from my SLAM course)

And we could also fuse features with features, in this case we'd fuse the highly correlated features, and not the rest. The reality is, sensor fusion can be done via many different approaches. There may be optimum fusion strategies, but we really have endless ways to do this.

Think about it, we could fuse:

Camera	LiDAR
2D Pixel	2D Point (2D LiDARs)
2D Pixel	3D Point (3D LiDAR)
2D Feature	2D or 3D Point
3D Pixel	3D Point (3D LiDAR)

So this is all the geometric way, without any Machine Learning. And you can already see how using these, we can already implement a complex early fusion strategy. I also talked about LiDAR and camera, but we can also fuse RADAR and Camera. We could fuse IMU and GPS. We could fuse multiple modalities in an early fashion.

Now let's see the same thing, but with Neural Networks.

Early Fusion with Machine Learning

When using Deep Learning, we add an additional layer of complexity... what are you fusing? Features? After how many layers? The very first? 3? 10? Does it matter? And what about the 2D vs 3D problem? Are you still projecting one to the other?

Okay, let's imagine this scenario, where we have a fusion layer like this one:

Deep Early Fusion

Notice how here, we fuse the raw data from Layer 1. This is a "true" early fusion.

On the other hand, if I first processed each data with a few convolutions, and then did a fusion after 3 or 4, it would look like this:

Deep Mid Fusion — it happens only at the feature level

But that would no longer be "Early" Fusion, we rather call it "Mid" or "Intermediate Fusion". Notice how it's incredibly similar to the idea of "feature fusion" before? It's because we can either do a "direct" fusion, or a "feature" fusion, and this either using traditional or deep learning techniques.

Now comes the problem of actual fusion. What happens if you mix eggs and chocolate? It doesn't work right? You need to mix the eggs first, and make the chocolate melt. Then you can do a cake. The fusion is the same idea, we're not going to "learn", but we are going to process things anyway.

For example, we could turn the LiDAR into a range view image, and fuse that with an image.

Range View Fusion (Mid in this case, but could be done Early)

So this is another way; not as for the practical fusion code, we're going to be in the concatenation, or addition layer; and this because we'll want to fuse a 2D range view with a 2D image, or things like this; that go together and can be added or concatenated.

Now, I'm going to digress a bit, but this idea will make much more sense if I talk about Mid-Fusion:

Bird Eye View Fusion

The most common way to do this is by converting both to a "common representation". I like to call it the "common ground". And it's often a Bird Eye View. You can get a full understanding of this via my post on BEV Fusion. For now, here is a picture I took from it:

Bird Eye View Fusion (source)

This may look like a "Mid Fusion" (and it is), but at least it shows the individual transformations to a Bird Eye View space, and these get fused. We use Bird Eye View because this way, everything is common. Pixels showing vehicles are exactly at the same position of point clouds showing the same vehicles, and so on... In this example, the data goes to Convolutional Neural Networks, and then gets converted to a Bird Eye View.

Quick Recap?

We are here:

And now the question becomes... What if we want to look at Mid or Late Fusion?

When to use Early Fusion... and when to use Mid and Late Fusion?

So, you may be thinking: "Okay, but is Early Fusion better than Late Fusion? Or are there cases where one should be used and not the other?"

I'm glad you asked.

There are 3 ideas I want to talk about:

In theory, Late Fusion would be better
But in practice, Early Fusion may be better
Or, does it depend?

Late Fusion in theory

Late Fusion is the idea of fusing objects, after they've been detected. In a Perception example, Late Fusion is the idea of mixing bounding boxes detected from the camera and from the LiDAR. If your bounding boxes are good, you should have no problem with the fusion.

To perform late fusion, many possibilities exist, but one I like is from my LiDAR/Camera Fusion article where you're fusing bounding boxes using criterias like 3D IOU, or by building a cost function.

The Late Fusion Process

Now this is theory.

In practice, is Early Fusion better?

What we're doing here has a more technical name: multimodal data fusion. You're fusing data, from multiple modalities/sensors. And this using artificial intelligence, or robotics, or machine learning. So this can get complex.

A first reason why late fusion may not work so well is, if one of your sensors isn't great. Back in 2018, I was working on autonomous shuttles with a 4-Layer 2D LiDAR SICK. The detections were there, but it wasn't really the 3D Bounding Boxes we see today.

In fact, here's a picture of the camera detections (in red) and the LiDAR detections (in green):

The Problem with Late Fusion — One faulty sensor can ruin it all. LiDAR: green | Camera; red

See? Late Fusion here is very difficult, because one of your sensors isn't great.

A second reason could be the algorithms themselves. If you're using a LiDAR Object Detector and this fails to detect objects, you're in trouble when doing the fusion. Same if you're doing 3D Object Detection on the camera.

See this example where the LiDAR confuses a traffic item with a pedestrian:

Late Fusion Problem: Algorithms can tell different things. Who do you trust, if you don't have access to the ground truth? Would you rather skip a pedestrian? Or break in the highway for no reason?

This is just one example, but hundreds of different examples of sensors disagreeing can occur; hence you have to build complex multimodal fusion strategies to handle unknown objects, or objects under-represented, etc...

Does it depend?

Now, I'm not saying Early Fusion is better. It could depend. If you're using similar sensors, with good algorithms, on simple tasks at low-speed, with handled environments, late fusion could be really what you're looking for. Especially since it's not that complex to design, and you have a lot of freedom.

On the other hand, early is either a projection of points to an image, or Deep Learning Feature Extraction, and this can become super complex to handle. So context matters, are your engineers comfortable enough with geometry? With deep learning? Or are they mainly sticking to basic pattern recognition and classification tasks?

So what's being done today in research? Let me show you...

Example 1: Late-To-Early Fusion by Waymo

In Late 2023, Waymo unveiled a new paper they named LEF: Late-To-Early Temporal Fusion for LiDAR 3D Object Detection; and this mixes both early and late fusion together. Now this is state of the art!

Here is what it looks like:

Waymo's Late-To-Early Fusion Strategy

You can find my complete explanation of this architecture in the video:

The highlights:

Early-to-early: we fuse raw data (t-1) with raw data (t)
Late-to-late, we fuse backbone (t-1) with backbone (t)
Late-to-early, we fuse backbone (t-1) with raw data (t)

In this same idea, we could see the Early-To-Early Fusion, but this time done temporally:

What Early-To-Early Fusion would look like — Temporal Fusion

Example 2: Aurora's Deep Fusion Architecture

I've written an entire article on Aurora's sensor fusion algorithms already, and I've used the example so much that I feel like it's been overdone. Yet, I believe it's the most representative fusion out there:

Aurora's Beautiful Sensor Fusion pipeline (source)

The highlights:

"Euclidean View" means Bird Eye View
"Range View" is the 2D LiDAR view we mentioned before
Fusion is done using RADAR, HD MAP, LIDAR, and multiple cameras

I highly invite you to read my post on it here to understand it better.

Alright, let's summarize:

Summary

When using multiple sensors, we have many different options as to "when" we want to fuse them. It could be at the data level, which would be called early fusion, or at the feature level, which would be called mid-fusion, or at the object level, which is late fusion.
Early fusion involves merging the data before you even process them. You can do this via projections from one frame to another (for example point clouds to image space); or by using neural networks.
The classical way (projections) involves geometry and formulas. It can be very reliable, but highly depends on your sensors and how they're calibrated. They're also involving you're "losing" a dimension, because projecting 3D points to 2D means you're now working in 2D.
The Neural Network way can feel more complex. In practice it involves merging data via concatenation layers, and then using the merged information. However, most researchers prefer to do feature extraction first, or fuse range views, and then fuse the sensors in the Bird Eye View space (mid-fusion).
Neither Early nor Late Fusion is better than the other; it's very dependent on context. Easier problems can be solves with Late Fusion, while others could try both approaches. Late Fusion has the drawback of being highly dependent on the sensors and the detection algorithms.

Next Steps

If you've enjoyed this article on Early Fusion, you're probably going to LOVE my article on Bird Eye View Fusion! It's all about how to fuse points and pixels in the Bird Eye View space, which I refer to as Mid-Fusion.

More important next step?

⏯️

I have recorded a 60' webinar called "The Step-by-Step Roadmap to follow to become a LiDAR Engineer!", and it talks about my experience along with hundreds of other engineer's I've trained, including companies like Mercedes-Benz, to work on LiDARs. We'll discuss Sensor Fusion, but also 3D Deep Learning, Point Clouds Processing, SLAM and more...
You can get access to the recording here and continue your adventure with me!

How Visual Inertial Odometry (VIO) Works

Jeremy Cohen — Wed, 03 Apr 2024 15:36:17 GMT

Sherlock Holmes was seated on a chair, blindfolded. Suddenly, someone took off his hood, to reveal a grand office where a man in his late 60s, Sir Thomas, is sitting in front of him.

"Mr. Holmes" started the man. "Apologies for summoning you like this. I'm sure it's quite a mystery as to where you are, and who I am." He started proudly.

Holmes, played by Robert Downey Jr, took a second of silence, before starting: "As to where I am -- I was, admittedly, lost for a moment between Charing Cross and Holborn. But I was saved by the bread shop on Saffron Hill, the only bakers to use a certain French glaze on their loaves. After that, the carriage forked left, then right, a bump over the Fleet conduit -- need I go on?"

In shock, his kidnapper then listened to Sherlock giving the rest of the speech, up to the punchline:

Sherlock Holmes giving his punchline "The only mystery is why you bothered to blindfold me at all"

One of the best skill Sherlock Holmes had in this scene is his ability to – even blindfolded – understand exactly where he is, and this by counting how long he drives to a place, identifying when he turns, and recognizing specific cues, such as the bakery's smell.

This is, my dear reader, exactly the principle of Visual Odometry in autonomous driving, and in this article, we're going to talk about how robots use cameras (visual = camera based) to recognize where they are, and how they're moving in a 3D scene. We'll see traditional as well as new, Deep Learning based approaches.

So let's begin, we'll cover 3 points:

Visual Odometry: the idea of estimating your pose based on what you see
Inertial Odometry: the idea of estimating your pose based on how you move
Visual Inertial Odometry: the fusion of the two

Visual Odometry

The first idea is "visual odometry", and this means... What if sherlock was not blindfolded? What if he was somewhere in an unknown place, with no map (and this will be our example all along the article), and was suddenly driven somewhere else??

Sherlock Holmes would use his eyes to recognize some key places, so he'd recognize a street where he turned, or he would see a bakery, a specific sign, and anything like this. In SLAM language, this is called a landmark. The problem is, in Computer Vision, you're not really going to detect bakeries and stuff like this.

Instead, you're going to use Visual Features. Visual Features are a key component, and I have an entire article on it, plus another one in the context of Visual SLAM. The idea of estimating your pose using Features goes like this:

Run Feature Detection And Tracking
Find Rotation and Translation matrices

Run Feature Detection And Tracking

To go really quick on it, because I already wrote about it (links above):

We detect features (pure pattern recognition of corners, edges, gradients, ...) and then match them from frame to frame. Algorithms like SURF, SIFT, BRISK, AKAZE, etc... are used for the detection and encoding, and others like Brute Force or FLANN can be used for feature matching. This gives a result like this:

Matching of Features from Frame to Frame

Okay, so we know how our features moved, then what?

Find Rotation and Translation matrices

From the features matched, we'll want to extract the camera motion, via some rotations and translation matrices. This is a popular 3D Reconstruction concept known as Structure From Motion(Sfm), and the idea is to use Computer Vision principles, such as the Fundamental and Essential Matrices; or the 8-Point Algorithm with RANSAC to recover R and T.

What are R and T? They're the Rotation (R) — a 3x3 matrix, and Translation (T) — a 3x1 vector, needed to go from the first set of features to the second. In theory, algorithms like the 8-Point algorithm use Singular Value Decomposition to figure out the Fundamental Matrix from 8 tracked features. In practice, we can use some OpenCV function to recover the position from the matched features:

p1 = #compute the features
p2 = #compute the features
K = #intrinsic matrix

# 1) Find the Essential Matrix E
E, mask = cv2.findEssentialMat(p1, p2, K, cv2.RANSAC, 0.999, 1.0);

# 2) Recover the rotation R and translation T
points, R, t, mask = cv2.recoverPose(E, p1, p2)

And you can even reconstruct entire scenes just using this. The principles are well detailed in my 3D Reconstruction DLC (an advanced module only for engineers already in my Stereo Vision course).

So this is the idea of Visual Odometry, we compute features, track them, and then recover the pose.

Inertial Odometry

The other type of Odometry is Inertial.

Let's go back to Sherlock. Imagine this time, he IS blindfolded, but he's walking, or let's say he knows exactly how fast the car that kidnapped him drives? Now, he can measure his travelled distance. If he walked 20 steps forward, then turned right, then took another 20 steps, he knows where he is.

This is "Inertial" Odometry, and this because you're often using one or more Inertial Measurement Units (IMU) to compute it. An IMU is going to use gyroscopes, magnetometers (measuring earth magnetic values), and accelerometers to compute how much you moved, how fast you're going, and how you're oriented. You can learn more about IMU through this Youtube video.

An example? Let's watch it on the KITTI Dataset here:

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

So what is Visual Inertial Odometry? The fusion of the two?

Yes! Exactly! And this will be our 3rd part (before seeing some examples). In VIO, we fuse what we learned from the visual features with what we learned from the IMU to achieve accurate state estimation. And how do we do this? How do we estimate the state of something by fusing two outputs? Anyone? Can someone plea—

"An Extended Kalman Filter!" yells a reader in the background.

Right! An Extended Kalman Filter! This is the main algorithm we'll use to combine the output from IMU with the output from Visual Odometry. How? Let's take a look:

Filtering Techniques for Visual Inertial Odometry

Well, the principle is what I call "sequential fusion", or sequential sensor fusion, because it happens in sequences:

You get data from Sensor A (or algorithm A), and run Predict
You get data from Sensor B (or algorithm B), and run Update.

And you do this in a loop, in a predict/update cycle, which is what a Kalman Filter does. I won't explain what a Kalman Filter is, you may check my Sensor Fusion post for this.

We said an EKF is the solution, but it's not the only solution. In fact, many algorithms today use another technique based on graph-SLAM: optimization.

Optimization Techniques for VIO

The second set is optimization techniques, and here, we're going to minimize a cost function that contains all the information. It's similar to Graph SLAM techniques, where we try to incorporate everything and have the minimal cost possible. Didn't get it? Me neither, let's see what I mean:

Imagine IMU says "we moved 2 meters" and camera says "we moved 2.5". Who is correct? In a Kalman Filter, it's (almost) an easy choice. We would do a Predict/Update cycle, add "weights" or covariances to each measurement, and have an answer somewhere in the middle. But if you're using optimization techniques, you don't do this.

Instead, you solve a Least Square Problem. This is often called non-linear optimization, and you usually refer to algorithms such as Levenberg-Marquardt or Gauss-Newton for this. This 'optimization' is about minimizing a cost function.

We want to satisfy every measurement. So if we're building a map, and try to hold uncertainties and assumptions for every possible landmark.

Taken from my SLAM course. Do you see how we could find one position to satisfy all these assumptions?

Alright, I won't really go further here, because it would require an entire book, but the gist is: we minimize a cost function.

Now let's see some examples.

Example #1: MSCKF (Multi-State Constraint Kalman Filter)

The process behind the MSCKF (multi-state constraint kalman filter) algorithm is shown here:

The MSCKF Algorithm diagram. Yellow = IMU | Blue = Camera | Green = Fusion (modified from source)

Here are the steps involved:

New IMU Data: Gotta start somewhere, and new inertial measurements are what begin our cycle.
Propagation (EKF Predict): We immediately run a Kalman Filter 'Predict' step. Using the new IMU data, the state and covariance are propagated forward in time based on the IMU model.
New Image Registration: Right after Predict, we add the new image.
Pose Estimation Augmentation: This step augments the state vector and the covariance matrix with information derived from the new camera images. Basically, we prepare our Kalman Filter for an "update".
Image Processing (Feature Tracking and Matching): This is where features are extracted from the images, tracked, and matched. Pure Visual Odometry.
Update (EKF Update): Finally, we update the state based on the features tracked! The filter incorporates the visual information to correct the state estimate and reduce uncertainty, improving the accuracy of the pose estimation.

Because it's a Kalman Filter, we'll do that in a loop. Step 7 is to incorporate a new IMU measurement, and so on...

So, to reduce it even further it's:
Inertial Odometry >>> Predict >>> Visual Odometry >>> Update >>> Repeat.

Example #2: VINS-MONO

Ever saw a LinkedIn post with 300+ likes showing a Visual Odometry algorithm called VINS-MONO? I have ✋🏽. The principle behind this VIO algorithm (Monocular Visual Inertial System = Mono VINS) is really cool, let's see:

The VINS-MONO Algorithm Diagram. Yellow = IMU | Blue = Camera | Green = Fusion (modified from source)

New IMU Data & Frame: This time, we process them in parallel.
Preprocessing: This step is both Visual Odometry (for the camera) and Inertial Odometry (for the IMU).
Initialization: This step is where we initialize our "graph" and put the values inside of it based on our two measurements. This step is already a fusion of the two values into the common map.
Relocalization: Finally, this step is an "optimization" step based on Loop Closure Detection. I wouldn't call this part pure odometry, it's already in the mapping part with SLAM. I have an entire article on Loop Closure here.
Pose-Graph Optimization: Same idea, here we're already out of Visual Inertial Odometry, it's really the global optimization after doing several laps of a city. However, this step also corrects the position, so it's a bit part of odometry too.

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

Are there just these two families?

Usually, yes. We could probably find a "Deep Learning" family, and explore sub-families of each like Graph-SLAM but in anything SLAM, we usually refer to these techniques: we either solve an optimization problem, or we use a Kalman Filter. If we look at the mini-graph, it would be like this:

The Graph of Visual Inertial Odometry

Okay, recap time!

Summary

Visual Inertial Odometry is the idea of fusing Visual Odometry (from the camera) with Inertial Odometry (from an IMU: Inertial Measurement Unit). Combining both can have great benefits, as for every sensor fusion algorithm.
Visual Odometry usually happens by detecting features, matching them from frame to frame, and then recovering the motion via Rotation & Translation matrices.
Inertial Odometry is the idea of fusing all the data from an IMU (gyroscope, accelerometers, ...) to get an estimate of how we moved over time. We can recover the 3D pose, acceleration, velocity, orientation, and more...
To do VIO, we can use a Kalman Filter based approach. We would typically use an Extended Kalman Filter, and then fuse IMU and Visual Features sequentially with a Precict/Update cycle. An algorithm like MSCKF does it.
We could also use non-linear Optimization algorithms, such as VINS-MONO, to optimize the estimate of the landmarks and our position via a cost function minimization.

Awesome, you've been through the article. 🙌

Conclusion

Visual Inertial Odometry (VIO) is an idea that is really popular in the SLAM space, because it's the only viable alternative to the total absence of GPS. If you have no GPS, then you HAVE to have an IMU and do the VIO processing. It's also a really good alternative to LiDAR Odometry, in situations where... well, where you don't have a LiDAR.

Today, VIO systems can work both on self-driving cars, or in an aerial robot (yeah a drone), or outdoor ground vehicles (for example for military applications), or in really most intelligent robots equipped with an IMU and one or more cameras.

We saw an example with one IMU and a single camera, but we can have a different sensor data setup. Some algorithms also work in Stereo, and thus leverage 3D Computer Vision; and we could also have multiple IMUs.

If you are to learn about SLAM, I would probably NOT begin with a topic like Visual Inertial Odometry. It's actually quite advanced, but you could fit it somewhere in your curriculum after learning about SLAM and Odometry in general.

Next Steps

As an introduction to this article (should have mentioned it earlier, oops), you could read my article on the The 6 Components of a Visual SLAM Algorithm.
You can also read my other Localization articles, such as Robot Mapping, Loop Closure, or my SLAM Roadmap.
I have a course on SLAM. It's most of the year closed, by I open sessions every once in a while. If you're interested, you can read the page here and join the waitlist here.

Finally:

📥

If you are not reading my daily emails yet, where have you been all this time? I'm sending emails every day about self-driving cars, robotics, Computer Vision, Deep Learning, and these are giving you explanations, tips, algorithms, but also concrete knowledge from the field, and many more. You can subscribe here.

The 6 Components of a Visual SLAM Algorithm

Jeremy Cohen — Fri, 08 Mar 2024 14:18:01 GMT

If you were asked to explain "Chat-GPT" to a friend, how would you do it? The task can seem difficult, there is SO MUCH to tell, to the point where many gave up on the idea of understanding LLMs at all... But what if, instead of trying to understand it, we would reduce it to the following 3 ideas:

Idea 1: The core architecture is a Transformer Network.
Idea 2: It's trained on the entire Internet using Self-Supervised Learning and can be fine-tuned on datasets using supervised learning
Idea 3: The core task is to predict the next word, or token, in a loop.

See? We significantly reduced the problem, now we only have 3 things to understand: Transformers, Self-Supervised Learning, and Next Word Prediction. We may still need a bit more after this, but we will have the gist of it.

Many concepts can seem very abstract at first, but once they're broken down into a few elements, we can suddenly "get it".

Visual SLAM is a similar problem. At first, it can feel very hard to understand; there are all these different types of SLAM algorithms using Kalman Filters, Particle Filters, Graphs, and there is Online vs Offline SLAM, and Loop Closing, and many more...

Once we break it down to simpler concepts, we can understand how it works. In our case, the 3 concepts are going to be following this map I saw in a fantastic research paper on Deep Learning. Here it is:

The Visual SLAM Workflow (source)

You can see there are many ideas in this workflow, but we can actually reduce it to 3 core parts:

Feature Extraction & Visual Odometry
Local Mapping & Optimization
Loop Closure & Global Optimization

So, let's see these 3 blocks one by one, we'll do two things:

We'll understand "what" they do
We'll list the possible algorithms that can work, the "how"

Part 1: Feature Extraction & Visual Odometry

The first thing I would like to talk about are these two little blocks here. One is called "Feature Extraction", and the other "Visual Odometry Estimation".

1 - Feature Extraction

Why feature extraction? And what kind of features are we talking about? The thing to understand is that Visual SLAM works with cameras. Therefore, we don't have a LiDAR. So how could we replace our Point Clouds from LiDARs? We could compute features on images, and then treat them as points!

Therefore, we'll use algorithms to compute Visual Features (edges, corners, gradients, etc...). Below is an image showing the top (visual features) versus bottom (point clouds).

Top: Visual Features (Visual SLAM) | Bottom: Point Clouds (Normal SLAM)

How it works

So now, how does it work? You can learn more about these algorithms for Visual Features on my dedicated article, the gist of finding feature points would be:

Corner Detectors: FAST (Features from Accelerated Segment Test), Harris corner detector.
Blob Detectors: SIFT (Scale-Invariant Feature Transform), SURF (Speeded-Up Robust Features).
Edge Detectors: Canny edge detector.
Feature Descriptors: ORB (Oriented FAST and Rotated BRIEF), BRIEF (Binary Robust Independent Elementary Features), BRISK (Binary Robust Invariant Scalable Keypoints).
Deep Learning: Using convolutional neural networks (CNNs) for feature detection and description. I have to say, every brave researcher who has ever tried to replace Visual Features with Deep Learning has FAILED! Don't do this.

⌨️

Algorithms to check: ORB, FAST, Harris Corner Detector, SIFT, SURF, ...

2- Visual Odometry

The second part is about Visual Odometry. Visual Odometry is the science of calculating how much your car moved based on the visual information. For example, how much did the point clouds move in the last second? How much did the visual features moved?

To achieve this, we have the pipeline that estimates the camera motion by analyzing the changes in the apparent motion of features across a series of images.

The Visual Odometry part includes feature matching, outlier rejection (often with RANSAC), and even Optical Flow!

Can you identify the core components from it? These are:

Feature Matching: matching and tracking features from frame to frame
Outlier Rejection: rejecting the features we matched wrongly
Optical Flow Estimation: Calculating how much each feature moved over time
Depth Estimation: Because knowing the depth is always helpful. Note that depth can be obtained using stereo cameras, which would likely make it better than using a monocular camera

The output? The visual odometry — the motion of your camera. See it in action below:

(source)

How it works

We've seen a way to make monocular Visual Odometry work, but it's not the only way. We could use stereo vision and do it differently. We could very well abstract this entire part, and say that there are many ways to compute the motion of the camera. In fact, we could even use an IMU (Intertial Measurement Unit) and that would give us Visual Inertial Odometry (VIO).

⌨️

Algorithms to check: RANSAC, Perspective N Point (PnP), 5-Point Algorithm, Feature Matching, Direct Sparse Odometry (DSO)

Part 1 was about extracting features and identifying their motion. What is this going to be used for? Let's find out.

Part 2: Local Optimization & Mapping

Local Mapping & Optimization are the CORE of SLAM

The next two blocks are local mapping and local optimization.

3 - Local Mapping

Do you remember when I told you that we could replace the point clouds with the visual features? Well, this idea has a flaw:

The point clouds are in 3D
But the visual features are 2D pixels on the image

And so, when you look at robot mapping, and the types of maps that robots use (I have a great article on this here), you see that we have point cloud maps, vector maps, occupancy maps, feature maps, etc... But we don't really have a way to use the visual features.

It would be great if we could "project" these 2D pixels to 3D, and this is what this step is about. Local Mapping takes a depth map and the extracted features, and put them in a 3D map. This is a real "mapping" step, where we map the 2D features to their corresponding locations in 3D.

paper: VIRAL SLAM: Tightly Coupled Camera-IMU-UWB-Lidar SLAM

How does it work?

There are MANY ways to make this step work; and this is THE core of SLAM. We could use Sparse Maps or Dense Maps. We could use 3D Reconstruction techniques like Multi-View Stereo & Structure From Motion. We could use Kalman Filters or Particle Filters, because at every step, we'll want to:

Predict and Update the position of each feature in 3D
Predict and Update the position of the camera in this 3D environment

⌨️

Algorithms to check: Triangulation, Stereo Vision, LSD-SLAM, EKF-SLAM, Fast-SLAM

4 - Local Optimization

At this step, you have, at each frame, a mini local map. This is cool, but is has an issue: it's prone to error accumulation. Think of it like a Kalman Filter, without it, we don't really have the consistency over time, and here, we have the exact same thing: we want to align our maps to be consistent from frame to frame.

If we project our features to 3D at each step, we could accumulate errors, and therefore this step is here to re-align the consecutive maps.

paper: Efficient Continuous-Time SLAM for 3D Lidar-Based Online Mapping

How it works

This step depends on the core SLAM algorithm you're using. If you're using a Kalman Filter, the Update step is going to handle the alignment of everything. If you use Graph-SLAM, you're keeping a giant graph with the pose of each landmark and yours, and this step becomes graph optimization. Algorithms like Iterative Closest Point can be also very relevant here.

⌨️

Algorithms to check: Iterative Closest Point (ICP), Bundle Adjustment, Ceres Solver

We are, at this point, in possession of tons of local maps, synchronized and aligned through time. We have done it all locally, but never really globally. It's therefore time to align the global map!

Part 3: Loop Closure & Global Optimization

The final step is about Global Optimization, we want our map to make sense at the big scale

The previous 2 blocks were "local". We were dealing with consecutive maps. The next and final 2 blocks are global. We are going to build global maps of the world we're visiting, and this using two ideas:

5 - Loop Closure Detection

Loop closure is one of the most interesting ideas in Visual SLAM and in SLAM in general. It's the idea of aligning points or features that have been already visited a long time ago. So, my drone passes the Empire State Building, then circles around Manhattan, then sees the Empire again. Boom. We close the loop, we align the point clouds from the first and the second visits.

See from this Visual SLAM example how it works by recognizing previously visited areas, correcting any drift that has occurred in the map or trajectory over time, and updating the overall map and robot's pose within it.

An example of Loop Closure Detection (KITTI Dataset)

How it works

The "loop closure detection" step is about detection. A very common approach for this is using Bag-Of-Words. There are graph based approaches too. If you'd like to learn more about Loop Closure, I have an entire article about it here.

⌨️

Algorithms to check: Bag-Of-Words, FLANN, DBoW2

6 - Global Optimization

Finally, we aggregate all the information from the local maps, run the bundle adjustment and optimize the map completely. This step is the alignment of the entire map! We can again use the techniques from local optimization, at a global scale. In many approaches, we're using graph optimization.

⌨️

Algorithms to check: Pose Graph Optimization, ATLAS, Bundle Adjustment

And here they are, the 6 components of Visual SLAM!

Let's do a quick summary, and then go see some -in-the-wild- Visual SLAM algorithms.

The Summary

The final workflow of a Visual SLAM algorithm

Here are our 6 steps! Notice how we naturally flow from the images to the maps and poses! Let's do a final breakdown:

Feature Extraction is the step where we try to extract visual features from images.
Visual Odometry is the immediate next step where we compute the camera motion by computing the motion of features from frame to frame.
In Local Mapping, we're projecting the features in the 3D space. This step IS the core SLAM step where we create the map, can be done using triangulation, a depth map, or others.
In Local Optimization, we want to align the consecutive maps from frame to frame using the local maps and the odometry.
The goal of Loop Closure detection is to identify previously visited places. We'll do this by comparing the current features with the history of features.
Finally, global optimization is about aligning the entire map and producing the final output: a map and a pose.

Get it? Awesome, now, let's see a few examples...

Example 1: ORB-SLAM

One of the most well-known algorithm in Visual SLAM is called ORB-SLAM. Before going in the details, here is a demo of how it works:

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

Feature Extraction: As the name suggest, we're using the ORB Features. These are very popular and work great in this case.
Visual Odometry: Next, we'll be matching these features with those in the previous frame or with the local map, estimating the camera's motion based on these matches, and refining the estimate through optimization techniques like RANSAC to minimize reprojection errors, thereby determining the camera's pose relative to its previous position.
Local Mapping: We create a local map by triangulating matched ORB features and optimizes it using local bundle adjustment. The "3D" is computed via the motion of features.
Local Optimization: This step is done using something named the "Bundle Adjustment". I won't describe it here, but it's an algorithm that minimizes the gap between observed feature positions in images and their predicted positions based on the camera model and 3D point estimates.
Loop Closure: This step works by using a Bag of Words (BoW) model to efficiently compare the current frame against all previous keyframes. When a potential match with a high similarity score is found, geometric verification is performed to ensure the loop closure is valid, leading to a pose graph optimization to correct accumulated drift and ensure global consistency of the map.
Global Optimization: This step is the final optimization, and in this case we're updating a graph, so it's a GraphSLAM.

Now, it's a bit more complex than this, the actual steps look like this:

The ORB SLAM Algorithm — Can you identify our 6 steps here? (source)

Conclusion

Visual Simultaneous Localization And Mapping means visually localizing your robot while simultaneously mapping the world with a camera. Today, Visual SLAM systems have become so good that they're used to build maps! Yes, originally, SLAM was used when you didn't have a map, and it was an edge-case. Today, it's probably one of the main ways companies use to build HD Maps.

In this article, I told you about the idea of detecting point features, but this is very "sparse". In fact, we could even call it "Sparse SLAM". When we get access to a LiDAR, or at least an RGB-D camera, we can also build Dense Visual SLAM algorithms. You could check Elastic Fusion or LSD-SLAM and see the difference when doing dense mapping. This is also a way to build an even more accurate monocular slam system.

We've discussed SLAM in autonomous driving, but realize that Visual SLAM can be used in so many other places... Like 3D Reconstruction, Computer Vision, or mixed and augmented reality. In fact, I'm pretty sure some of these algorithms run in Apple's Vision Pro. And that would make sense, considering the use case!

If you want to learn more about SLAM, and in particular about Visual SLAM, then I invite you to read the next steps...

Bonus Video

Next Steps ↪️

You can read this article on Robot Mapping which could help you understand what is a map (useful, huh?)
To get more in-depth on some topics, you could read my Point Cloud Registration article (about aligning point clouds) and my Loop Closure article (about closing loops). You're gonna have some fun there.
If you're interested in becoming a SLAM Engineer, you would also benefit from my SLAM Engineer Roadmap, in which I reveal the core skills to learn.
Finally, I have a course on SLAM, but I am not sharing it with just everybody, so I invite you to check the yellow box below to join the daily emails, in which I'll send you more information about it 😉

📥

If you want to really go far in the field of autonomous systems, I'm sending DAILY emails to engineers like you. In these, I'll share a lot of information with you privately, from technical articles like this one, to insider stories I lived as a self-driving car engineer, to my best career tips to get your first job in the field. You can join it here.

A look at Spatial Transformer Networks for Self-Driving Cars

Jeremy Cohen — Mon, 22 Jan 2024 15:30:26 GMT

Could you shoot a movie without doing a single cut? Well, maybe not an entire movie. Now what about a scene? A few years ago, I was in a theatre watching the opening scene of "La La Land" with my (then) girlfriend — and I had been shocked by how well crafted was that intro scene "Another Day of Sun".

This 6 minute long scene is unique because it has almost no "cut". It's an uninterrupted take, moving the camera from actor to actor, car to car, all on a highway. Now, can you imagine if movies didn't have the ability to "cut"? The making of a movie would be much harder, and this because cuts allow for perspective, zooming, changing angles, ratios, and much more...

It's the same with Deep Learning & Convolutions. When you use the standard convolutions, you have that uninterrupted flow of feature maps. But this is playing the game on hard mode. When you use "cuts", you can zoom on a specific feature map, rotate it, change perspective, etc...

In Deep Learning, this cut is achieved via an algorithm called a Spatial Transformer Network.

Since 2015, Spatial Transformer Networks (STNs) have been one of the most practical algorithm in fields like Computer Vision and Perception. The main particularity is that they allowed to apply transfomations in the Feature Space directly, rather than on input images. This made them very convenient, and easy to just "plug" on any network; from perspective transforms to point cloud processing.

In this article, we'll explore Spatial Transformer Networks, see how they work, where they can be used, and try to understand what you as an engineer can do with this knowledge.

What are Spatial Transformer Networks?

To put it simply, they're the equivalent of doing "cuts" in a movie. Rather than looking at an image or scene from the same perspective every time, you apply transformations, and therefore have access to these cuts.

An example?

A Spatial Transformer Network zooming and adjusting a Traffic Sign (source)

See? It's like zooming always to the traffic sign. And "zooming" is only one of the many spatial transformations we can do. We could change the perspective, rotate, and do many things on these images. The key idea is that we input an image, and output the transformed image.

It's something happening inside a network, that isn't always needed, but that could be very much useful in specific cases. Let's say for example that you have a digit classifier, a STN could be applied to then classify the traffic sign as 120. Another example could be when you need to apply spatial transformations to feature maps, like for example when taking them to the Bird Eye View space.

We will discuss examples at the end of the article, but right now comes a question:

How does a Spatial Transformer Network works?

This is a STN according to their original paper:

The Spatial Transformer Network architecture (source)

Notice the 5 key parts:

U — the input
The Localisation Net
The Grid Generator
The Sampler
V — the output

First, notice that both U and V are feature maps here. This means that you usually don't apply a Spatial Transformer on an image, but rather on top of feature maps. It's going to be the same for the output, it's all feature maps!

The Localisation Net

The localisation net is a neural network that regresses a value for theta θ. Don't try and search very complex architectures, it can just be 2 fully connected layers, that try to predict a number theta. This theta is like a "rotation" parameter we'll use right after. In code, it can be as simple as this:

# reshape your convolutions into a vector
xs = xs.view(-1, 10 * 3 * 3)

theta = nn.Sequential(
            nn.Linear(10 * 3 * 3, 32),
            nn.ReLU(True),
            nn.Linear(32, 3 * 2))(x)

See? it's a simple thing that, at the end, predicts 6 values (3*2). What is theta made of? These are the 6 parameters of a 2D affine transformation, these parameters control scaling, rotation, translation, and shearing along x and y. Look at this for example:

The 6 parameters of an affine transformation (source)

If you want to explore affine transformations further, I would highly recommend the amazing source I linked that show how each parameter controls rotation, scaling, etc...

So, we get that we have a network that "learns" how to modify feature maps from one plane to another... what then? Then we need to apply the transformation — and this is what the grid and the sampler.

The Grid Generator

The grid generator is a function that creates a grid that will remap pixels from the input feature map to pixels from the output using the theta matrix.

How a Grid Works (modified from source)

What's interesting about this is that we already know the target, and the theta matrix, so we essentially construct a blank image (called a sampling grid), and then try to find the source pixel and paste it in the blank image (we work backward).

Finally:

The Sampler

The sampler is what actually does the transformation. If you got it right:

The localization net calculates the needed translation, rotation, etc...
The grid generator calculate the source and destination pixel for each pixel

Therefore, the sampler is going to take both the input feature map and the sampling grid and using e.g. bilinear interpolation, output the transformed feature map.

The final formula to retrieve the "V" feature map

In code, these affine transformations are so well-known that you don't even need to do them by hand, you can just call PyTorch functions and it will do these for you. Something like this.

grid = F.affine_grid(theta, x.size())
x = F.grid_sample(x, grid)

So you now get what a Spatial Transformer does, and we could even look at some code example:

    def stn(self, x):
        ... some convolutions to get feature maps
        xs = xs.view(-1, 10 * 3 * 3) #resize to vector
        
		# localization net
		theta = nn.Sequential(
            nn.Linear(10 * 3 * 3, 32),
            nn.ReLU(True),
            nn.Linear(32, 3 * 2)
        )(x)

        theta = theta.view(-1, 2, 3) #reshape to a grid

		# Grid Generator
        grid = F.affine_grid(theta, x.size())

		#Sampler
        x = F.grid_sample(x, grid)

        return x

Now, let's see 2 examples that I think you'll like...

Example 1: Bird Eye View Networks

In my course on Bird Eye View Perception for Self-driving car engineers, I have an entire module on homographies, and how they can be used to generate "bird eye view" scenes. However, there is also a project where you can learn to create a spatial transformer module that take feature maps into the Bird Eye View space, like this:

Taking Feature Maps to the Bird Eye View space using STNs (source)

Now realize that once you have a Bird Eye View feature map, the job is almost done. You could even use several spatial transformer modules (one for each image) and then fuse multiple bird eye view maps into a complete 360° scene, and end up with multi view Bird Eye View Perception systems like here:

Bird Eye View Segmentation helps not only build accurate context maps, but is also an ideal format to directly perform Motion Planning inside (source)

Next, let's see a second example...

Example #2: Point Cloud Processing

Ever noticed how Point Clouds always show up from different angles? I mean, let's say you want to classify a point cloud using Deep Learning, like this:

Point Cloud Classification using 3D Neural Networks (source)

Well — this is nice, but what if your point cloud come up in the other direction? Or from the front? How do you handle rotations? In an image, it's pretty straightforward, because there is this nice continuity of pixels, and we have convolutions... But in a series of 3D points, the order of points are mixed up, and you can't directly use convolutions.

So you use Spatial Transformer Networks! In algorithms like PointNet, an STN is used via 1x1 convolutions to rotate and, as they say in the paper, "align the point cloud to a canonical space". This allows to always have the point cloud in the same angle when you process it, and thus makes things easier.

Keep in mind that this algorithm (PointNet), that you can by the way learn through my 3D Deep Learning course, is used in the majority of the point based 3D Deep Learning algorithms, and so it's therefore omnipresent.

Okay but...

Aren't Transformers also doing it?

You may have heard of another type of Transformer Networks... Transformers! The models behind Chat-GPT, but also behind Vision Transformers —and you may wonder whether they can do this too?

Well, no. In fact, they may have a similar names, but they actually serve different purposes.

Transformers: Originating from natural language processing, Transformers are designed to handle sequential data, focusing on capturing dependencies regardless of distance in the sequence. They are now also used in various fields, including computer vision, but their primary function is to model relationships in data, whether it's words in a sentence or pixels in an image.
Spatial Transformer Networks (STNs): STNs, on the other hand, are designed specifically to apply spatial transformations to data. They allow a neural network to learn how to spatially transform an input image in a way that enhances the extraction of relevant features. This includes spatial manipulation tasks like scaling, cropping, rotating, or warping the input image.

So, they're not the same; and while we could train a Transformer Network for this kind of task, there's no point when we have a module designed specifically for this particular operation.

Alright, time for a summary!

Summary

Spatial Transformer Networks (STNs) are the equivalent of doing "cuts" in a movie. They allow to always look at a scene the way we want to, and to have an ideal view every time.
The input and output of a Spatial Transformer Network are most of the time feature maps — meaning we have already been through convolutions and we process what's in the middle of a existing convolutional architectures.
Via the localization network, we can do rotations, translations, shearing, and even scaling by learning affine transformation parameters. This goes way beyond doing "cuts", and this θ value is made of 6 parameters (in 2D).
The grid generator then calculates a transformation of each pixel in the feature map following the θ parameter.
Finally, the sampler is going to apply the transformation and generate the output feature map.
We can use STNs in many applications, like Bird Eye View Networks, or even Point Clouds Processing with Deep Learning.
Spatial Transformers are different than Transformers and serve different purposes. STNs are done specifically to calculate a transformation of a feature map from one plane to another.

Next Steps

Are you interested in learning more about using Deep Learning in Self-Driving Cars? I'm regularly talking about Deep Learning through my private daily emails. They're ready by over 11,000 engineers all over the world, and will help you understand how to use Deep Learning in Self-Driving Cars.

👉🏼 Fact is, you already missed today's email that I sent a few hours ago... make sure to receive tomorrow's email here.

Robot Mapping for Self-Driving Cars (3 Steps to create HD Maps)

Jeremy Cohen — Tue, 19 Dec 2023 11:31:49 GMT

Imaging you were asked to launch a self-driving car service in you neighborhood by next month. What would your first instinct be? Think about it. What does it look like? Is it a house in a mountain? Or are you in a busy street?

Do you know what my first instinct would be? It would be to take my car, and drive around this place that I know by heart. And then, I would take notes. Lots of them. I would look at the streets, the possible difficult areas, bridges, traffic signs and lights, places with no GPS signal, etc... and do a heavy "terrain analysis" for days.

In short... I would create a map! Mapping is the essential first step of an autonomous vehicle, because without a map you can't drive. It's very common with self-driving cars, but it's also a very important and common step in robotics too.

So how does robot mapping work? What are the main steps? Things we should know? Let me take you on the autonomous shuttle journey, and explain to you the core concepts.

In this post, I don't want to do an "overview" of all the mapping solutions and companies available, but rather show a "raw" approach to mapping. And this approach is going to happen in 3 steps:

GPS Tour — We begin by taking a tour of a place, and using the GPS data, looking at Google Maps and looking at the area we'll drive in.
Mapping — We convert our Google Map into a grid, or a graph, or anything more specific, and where our algorithms will be able to work on.
HD Mapping — We add elements, such as signs, traffic lights, speed limits, etc...

Step 1 — GPS Tour

As you noticed, our first level is very high. We just want to take a look at Google Maps, do some laps, record some data, and wonder "What do we have here?"

Driving with the GPS on, and Google Map Visualization

Back when we inaugurated our new city, we began to drove in a specific loop, and tried to "note" the main elements:

What are the roads and streets you'll be taking?
Are there any notable challenges? (traffic lights, roundabouts, lane merging, ...)
Is it an open road with many vehicles and people or a closed road?

The first step is to mark the roads, and define the notable elements

We may have to drive several times, in different weather conditions, just to make sure that the GPS will be stable the entire time. We may make sure there isn't anything we'll miss, and then add this entire thing to our map.

Adding the notable elements to our Google Map, so far, no driving map is created

Once we have that map of the notable elements, we can start thinking about Perception and Planning. And notice how we don't start with Perception. Although the logic chart wants Perception to be the first place we begin, the reality is finding the robot localization, and especially the step of Mapping is where it all starts. Why spending time on a lane line detection system if we'll drive without lines?

Similarly, we should try and answer as many questions as we should: Do we need a traffic light detector? Is there a tunnel? Do we need ultra-wideband? Is GPS stable on the entire region? Will a single GPS be enough? Should we build a specific localization algorithm based on lane line detection? Do we need to detect obstacles?

All these questions will feed our Perception and Localization algorithms, and have us decide whether we need one algorithms our not.

Note: If using a robot or a drone, this first process will be different — but you still need to think about the environment you'll drive in "on paper" before doing any kind of real algorithmic mapping. The process will be discussed further in the 2/3 of the article.

Once we know where we'll drive on, we can move to step 2.

Step 2 — Mapping Format

The second step is mapping. By mapping, I mean building a real map that the self-driving car will read and use to drive on. But what type of map can we build? Here are a few that exist:

Feature Maps
Occupancy Maps
Point Cloud Maps
Vector Maps
Raster Maps
Other types of Maps

Feature Maps

The first type happens when you detect features, like corners and edges, and convert that into a map. Features can be point clouds (if using a LiDAR), or visual features like corners and edges if using a camera. Using Sensor Fusion (the science of fusing sensor data), we could also fuse point clouds and images, or filter some elements.

A Map built using Computer Vision Visual Features (source)

💡

How to create feature maps? We can use SLAM (Simultaneous Localization And Mapping), and in particular Visual SLAM algorithms like ORB-SLAM or RTAB-MAP to convert visual features onto a map.

Occupancy Maps

In occupancy based mapping, we want to assign specific locations to values like drivable/non drivable. The most popular algorithm is Occupancy Grid Mapping, in which we discretize the word into cells, and then assign a value for each cell (occupied or free).

An Occupancy Grid Map

The 3D version of Occupancy Grid Mapping is what Tesla does with 3D Occupancy Networks, where rather than 2D cells, they use 3D voxels:

Occupancy Grid vs Occupancy Volumes (source)

💡

How to create occupancy maps? This can be done using Bird Eye View Networks, or Occupancy Networks. We turn the images into a Bird Eye View, and then discretize it into a 2D grid or a 3D voxelized space.

Point Cloud Maps

Another type of map can be done entirely using LiDAR. It's similar to converting visual features into a map, but this time, each of the points we detect are features. We then turn the features into a map.

Here's an example of building a point cloud map:

A SLAM algorithm using Velodyne LiDARs (source)

💡

How to create point clouds maps? The most popular algorithms solve the SLAM problem, like Hector Maps or GMapping. SLAM doesn't mean that we don't use GPS, or odometers, we can also be using these into a full pipeline.

Vector Maps

Probably the most popular type of map in autonomous driving is a Vector Map. The key idea is that you are going to find "vectors", which are nothing but arrays of numbers. These numbers represent values for lane lines, static objects, road curves, etc...

In vector maps, we reduce the problem to finding polylines. What we often do is annotate these maps manually with dedicated software like QGIS.

A Vector Map of an Intersection in Boston (from my Bird Eye View course)

The thing with these maps is that you can add points, lines, and polygons everywhere you like. Therefore, you can decide that between node 10 and node 20, we have to drive at x speed, and that at node 21, you have a traffic light, etc...

💡

How to create vector maps? What you can do is use software like QGIS and manually annotate everything. Another technique I love is to build a Bird Eye View of the world, and then find lines and curves directly on the BEV image, and convert it to a vector.

Raster Maps & Other Maps

One other big type is called a raster map.

Raster maps are a type of digital map represented by a grid of pixels, or cells, where each pixel holds a value corresponding to a specific feature or measurement of the area being mapped. This format is often used to display continuous data such as elevation, land use, temperature, or color, and is characterized by its pixelated appearance at high zoom levels.

Example here:

source: Wikipedia Commons

They aren't (to my knowledge) the #1 choice in autonomous driving. So now that we know a type of map jump immediately to the final piece: Build an HD Map.

Step 3 — HD Mapping

This step isn't mandatory for all kind of robots/cars — but just those that need the "HD". So what is the difference between an HD Map and a... SD Map? In short, the level of details.

In a SD Map, we're going to get the rough information needed to drive: static obstacles, roads, and that's kinda it. In HD, we'll want to add as many information as we can, and this can go up to the specific curvature of the road, to the speed limit allowed at a specific position, etc...

You may for example use additional tools for HD Map Annotation, such as LGSVL which gives rich information like traffic lanes, lane boundary lines, traffic signals, traffic signs, pedestrian walking routes, etc.

Left: The HD Map | Right: The Annotation Tool (source)

These annotations can then be exported into formats like Apollo 5.0 HD Map, Autoware Vector Map, Lanelet2, or OpenDrive 1.4, so users can use the map files for their own autonomous driving stacks. By the way, some other tools are Autoware, Apollo Open Platform, HERE Maps, Camera, etc...

In tools like Autoware, it looks like this:

Creating nodes and a map from scratch with Autoware (source)

So this is the final step: HD Mapping.

What about Kalman Filters?

We've talked a bit about mapping in this article, but when it comes to localization, the word "Kalman Filter" almost always comes up. A Kalman Filter is an algorithm (in the artificial intelligence field, not Deep Learning), that can fuse data, help with unknown data, and estimate a state continuously over time.

Kalman Filters, as well as many other probabilistic robotics algorithms (particle filters, SLAM, ...) is part of mapping... Yet, they can also come up a bit later in the localization step. For example, Extended Kalman Filters have been heavily used in mobile robots, and also in self-driving cars.

Is Robot Mapping the same as Car Mapping?

Something to think about is that maps aren't the same when mapping on a robot or a car. If the purpose is different, then the map can be different.

We already saw that autonomous vehicles used HD Maps where the main difference is the addition of lane lines as poly-lines, and other things, such as speed limits, bumpers, traffic signs, etc.... Now what would a robot or a drone need?
A drone will require a different kind of map: a 3D Map. Since the drone drives in three dimensions, the map HAS to be built in 3D. Most drones today use 3D Reconstruction and Photogrammetry to recreate 3D scenes, and then turn it into an interpretable map. They can also include terrain or specific weather conditions.
A robot will usually be good with an SD map in 2D. Although there are exceptions, most robots use SLAM algorithms to create simple maps, and then use these maps to drive. Below are examples of the detail on these 3 types of maps.

3 Types of Autonomous Systems, 3 Types of Maps

Now, let's see some examples of mapping software and companies.

Example 1: Autoware on ROS

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

In this video, you can see the Autoware software doing some mapping with ROS. Autoware is an open source software that can build HD Maps using LiDAR Point Clouds and LaneNet2 topology.

Thousands of companies and mapping systems rely on it, it's likely the OpenCV of Maps. It also works with the SVL Simulator described above, which can build a solid visualization and training tool.

Example 2: Mobileye's AV Maps

On the other hand, mobileye is building AV Maps, which are HD Maps built by a fleet of customers. Many cars today are equipped with mobileye technology, which provides cameras and ADAS systems to consumer and carmakers. Using the data collected by the cameras, mobileye is creating maps that also get updated every time a user goes through it.

Here's how it works:

Data Collection & Transmitting: We detect objects using Computer Vision algorithms on the cameras and send that to a cloud server
Aggregation & Alignment: We aggregate every information from every car
Modeling & Semantics: We add as much information as possible (semantics, lane lines, ...) and compile a final map.
Roadbook & Localization: We send that map back to the car that localizes in it.

0:00

Mobileye's AV Maps in 7 Steps (source)

These AV maps have been built using their "Road Experience Management" mapping system, which now claims that since they know all the user's driving decisions, they no longer to build lane lines anymore. Where the fleet goes, the cars will go.

Unlike HD Maps, that are a manual process, this approach is:

Scalable (done by the fleet)
Automated (done all the time)
Real-Time (if there is roadwork, blocked streets, etc...)
Evolving (cars can detect new things, add new elements, etc...)

Ok, we got quite a bit of information, let's do a summary.

Summary

In Robotics, a map is the environment that tells the robot where it is, and where to drive. Today, most self-driving cars use a map to drive, even a really basic one.
There are 3 main steps when mapping: GPS Driving (observation), Mapping (creation), HD Mapping (customization).
GPS driving is the idea that if you want to build a map of a region, you first need to drive in it, explore, and collect some information, such as the speed limits, forbidden roads, traffic lights, etc...
There are several types of maps, from vector maps (a set of polylines, points, ...), to raster maps (2D images), to point cloud maps (LiDARs), to occupancy maps, to feature maps (processing visual information like features). Depending on the type of map, the algorithm will differ.
Kalman Filters can be used when building maps (SLAM, ...), but often comes in the localization step
HD Mapping can be done with mapping, but we usually use additional annotation tools like LGSVL to enrich information.
A map can have many definition, depending on the device it is used on. A drone map will be very different from an indoor robot map, or even outdoor robots, because it works in 3D, usually only with cameras.

Next Steps

If you want to learn more about mapping, I invite you to check out Autoware, who's building an open source self-driving car, and has lots of documentation on maps and mobile robotics in general.
To learn more about localization systems in autonomous driving you can read this article on Self-Driving Car Localization.
The logical next step after Localization is also Path Planning, I highly recommend reading my last article on High Level vs Low Level Motion Planning in Self-Driving cars.

But the bigger next step?

📥

If you want to really go far in the field of autonomous systems, I'm running a DAILY newsletter, in which is share a lot of information, from technical articles like this one, to insider stories I lived as a self-driving car engineer, to my best career tips to get your first job in the field. You can join it here.

BEV Fusion: Why Sensor Fusion is best done in the Bird Eye View Space

Jeremy Cohen — Mon, 04 Dec 2023 19:27:27 GMT

It was 2009, and Bob Iger, the CEO of Disney just got ghosted again by Marvel's CEO Ike Perlmutter. For the past few months, Disney tried to acquire Marvel, without much success. Not only Marvel's head was a mystery to Disney's CEO, but he also and mostly did not want to sell.

But Bob was decided, and saw a huge potential in Marvel. Against his board's recommendations, that included Steve Jobs (who had just sold Pixar to Disney, and was opposed to buying Marvel), Bob needed to find a way to convince Ike to sell Marvel and the thousands of characters it had.

And at some point, he came with THE argument: Brand Protection. Who could protect Marvel's brand better than Disney, who protected their own brand for a hundred years? Who could create characters better than Disney? Who was more involved in storytelling and universe building than Disney? If Marvel were to be sold, it would be to Disney, and nobody else.

This argument helped convince Ike that Disney and Marvel had a common ground, and showed Marvel that its future would be in good hands with Disney.

Common ground is also what makes Sensor Fusion work the best. A LiDAR and a camera are like Disney and Marvel: they aren't the same sensors, they're not even on the same dimensions. But when you apply a tiny transformation to both of them, you can find a common ground perfect for Sensor Fusion.

This transformation is called Bird Eye View... and in this article, I want to talk about the BEVFusion algorithm, which is one of the best algorithm to fuse sensors such as LiDARs, RADARs, or multiple cameras. In this post, we'll see why the algorithm is needed, how it works, and who uses it.

Why Bird Eye View is a great solution for Sensor Fusion

What happens when you try to fuse 6 camera images with a LiDAR point cloud? Well, you have a dimensionality problem. Your point clouds are in 3D, but your pixels are in 2D. So far, existing solutions involved projecting one space to the other, such as point clouds to the images. But when doing so, we were losing part of the camera and LiDAR information.

For example, if you project the LiDAR data (3D) to the camera space, you're losing geometry — and if you're doing the other way around, you're losing the rich semantics from the camera.

Just like in a couple, one can't change the other. Both have to evolve together. (source)

Similarly, if you fuse the detections of the LiDAR sensor (3D Bounding Boxes) with the camera detections — which is called Late Fusion — you're reduced to doing object detection only.

This is why Bird Eye View is a great solution: it's a way to preserve both geometric structure and semantic density, by adopted a common representation. And in our case, we don't even fuse the raw data, but the LiDAR and camera features directly. This will be useful not only for fusion, but also to build our HydraNet...

How BEV Fusion works: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

This is the architecture of BEVFusion:

The algorithm works in 5 main Stages (source)

Notice what's happening? We have 5 main stages:

Encoders: Each of the sensor data goes through an encoder and becomes features.
BEV: Both the camera and LiDAR features are transformed into BEV Features.
Fusion: The BEV Features from both sensors are concatenated together.
BEV Encoder: We then have an encoder designed to learn from the concatenated features.
Heads: Finally, we have heads for object detection and BEV map segmentation.

1 — Encoders

Encoders are meant to learn. Traditionally, it's just a set of convolutions designed to transform raw data into feature maps. In our case, we use:

The Image Encoder: It can be a ResNet, VGGNet, or any other you like.
The LiDAR Encoder: It can be a PointNet++, learning from the points directly, or it can be a set of 3D Convolutions happening after a voxelization process.

In the authors implementation, several combinations of encoders are tested, and one including voxelization, pillarization, and CNNs is used. If you just realized you're weak on 3D Deep Learning for Point Clouds, I invite you to check out my Deep Point Clouds course here.

2 — BEV Transformations

Once we have been through the first layers, how do we turn them into Bird Eye View features? We have two types of transformations:

Camera to Bird Eye View
LiDAR to Bird Eye View

In the camera case, BEVFusion uses a technique called Feature Lifting, which involves predicting a probability distribution of depth for each pixel. This means that for each pixel, you predict depth. This generates a camera feature point cloud.

The general idea behind Feature Lifting (source)

An example?

After a convolution, pixel (0,0) has a feature value of 2.
After lifting, we predict the depth distribution (0, 0.5, 1.5, 0.5, 0)
We then multiply 2 with the most likely (1.5) and get a 3D feature value we can use.

We then use BEV Pooling to aggregate all features within cells of a BEV grid. The grid is used as a common ground across several images, several sensors.

The BEV Pooling Operation: Each pixel is turned into a point, and then associated to a grid in the 2D space. The features in the same grid are then added.(source)

In the LiDAR case, features are already in 3D, so the main thing we need to do is to flatten these along the Z dimension: we make them 2D, but in Bird Eye View space.

3 — Fusion

This step is where the magic happens, and yet, it's the simplest. Everything is under Bird Eye View representation. Therefore, to concatenate them together, all you need to do at this point is... A Concatenation operation!

Something like `torch.cat()` works relatively well, and we're good to go.

What you can also see here is that the potential goes beyond LiDAR/Camera Fusion. We could just as easily do the same operations on RADAR point clouds, and end up concatenating the RADAR as well.

The BEV Framework also works for RADARs (source)

BEV AutoEncoder

Okay, I lied. If you just superpose the features from the camera and the LiDAR, there may be some misalignment... and this because the camera conversion may be a bit imperfect, or there may be some realistic LiDAR malfunction. This is why at this stage, we apply a convolution-based BEV encoder (with a few residual blocks) to learn more and compensate for the misalignments.

Heads

Finally, you must realize that BEVFusion is a HydraNet. This means that for this single network, we have several heads, and we can therefore do several tasks. Each head solves one task, so instead of having a network for object detection, and another one for segmentation, we have just one network with two heads.

Similarly to a backbone, the heads can be interchangeable. You could use the head from YOLO, or SSD, and that would work. In this case, they're using the head of CenterPoint, a 3D Object Detection algorithm that regresses the center of bounding boxes, as well as the object size, rotation, and velocity.

On the segmentation side, the algorithm predicts a BEV segmentation mask among 6 classes: drivable space, pedestrian crossing, walkway, stop line, car-parking area, and lane divider. This isn't like a regular segmentation, it's actually a map being predicted. From there, you can predict a path and drive!

Output

The output of the algorithm is 2 tasks: 3D Object Detection & Map Segmentation.

The output (source)

Example? How Aurora uses Bird Eye View for Sensor Fusion

Going larger than this one algorithm, you can see how the Bird Eye View space is a great choice for Sensor Fusion. For example, in this article, you can see Aurora using Bird Eye View, or how they call it "Euclidean View", to build this common ground representation.

Aurora's Insane Deep Sensor Fusion Network (source)

BEVFusion is introduced as a powerful solution for a reliable autonomous driving system; and I bet a lot on this kind of architecture for general sensor fusion. It begins with separate encoders, and then turns each of the feature maps to the Bird Eye View space. Each set of BEV Feature Maps is then concatenated, before we add an Encoder and Heads on top of it.

Next Steps

If you'd like a general overview of Sensor Fusion, I invite you to read my Intro to Sensor Fusion article here.
If you'd like to learn more about Early & Late ways to fuse LiDAR & Camera data, I invite you to read my dedicated article here.
To get further on implementing Sensor Fusion algorithms, you can take my Visual Fusion course here, and my Bird Eye View course there.

SLAM Roadmap: Which skills are needed to become a SLAM Engineer?

Jeremy Cohen — Sat, 04 Nov 2023 22:16:59 GMT

It was a rainy Wednesday night, when I arrived right on time for my weekly "Self-Driving Car meetup" in Paris. I was one of the participants, and this week, I was excited to meet "Navya" a leader in autonomous shuttles in France back in 2018. Their director of Perception was on-point, interesting, and explained many of the algorithms they were using in their autonomous shuttles...

Yet, one of the algorithms was NOT part of the explanations, and I had to ask him in front of the entire crowd. "What are you doing for localization?!" I asked nervously. To which he replied: "Oh, we SLAM everything!"

He then showed us an army of SLAM Engineers sitting in the audience.

In most autonomous driving projects, the #1 job people target is Computer Vision Engineer. Yet, many don't realize that other parts of the autonomous driving, like LiDAR Perception, or SLAM, can be an amazing opportunity to them.

First, these roles are more rare, and less "in search". So this means that while it may be harder to join a specific company, it will be much easier to keep your job. Second, positions like this are often very related to Computer Vision & Perception, especially these days when everything is mixed up.

In this post, I'd like to talk about being a SLAM Engineer, discussing what they do, and what's needed to get there. We'll try to answer a few questions, such as:

What is SLAM in robotics? And what companies work on SLAM?
Which sensors and algorithms are involved?
What are the skills and technical expertise needed?

What is SLAM in robotics? And what companies work on SLAM?

SLAM stands for Simultaneous Localization And Mapping. This involves the two components: localization and mapping. But these can have different meanings depending on the company you're working on.

Let's take back the example of the autonomous shuttle. In any autonomous robot, car, shuttle, drone, etc... SLAM will fit in the localization module, and the goal will be to locate yourself in a map you also create. In this case, you'll work on the team that is using the sensors (camera, GPS, LiDAR, ...) to build a map and continuously localize the robot inside the map.

SLAM belongs to the second step: Localization

As you can see, we have two ways to localize ourselves:

With a Map
Without a map (SLAM) (this article!)

Not all companies using SLAM are robotics based. You can see SLAM Engineers working at Apple, Meta, and tons of Computer Vision startups. This is because sometimes SLAM is used to drive in an environment, but sometimes it's simply used to create that environment in 3D, for example in AR, VR, and other 3D scenarios.

While SLAM seems to be primarily a robotics thing, it has immense use cases in Augmented and Virtual Reality

We therefore have two types of companies working on SLAM:

Autonomous Robot companies: Your job will be to use SLAM to feed the plannign algorithms
Non Autonomous Robot companies: Your job will be to use SLAM for purposes like AR/VR/3D Reconstruction, etc...

Which sensors and algorithms are involved?

Now that we have a good idea of what SLAM is, and what is your role as a SLAM Engineer, we need to zoom into the "how" you'll fill this role. Since your job is to both create a map and localize yourself in it, it involves several sub-techniques, such as:

Creating a 2D or 3D environment (map) using sensors
Defining your position in a map
Keeping the map up to date, and the position precise

Creating a 2D or 3D environment using sensors

There are mainly two types of sensors used in SLAM: LiDARs and Cameras. Let's begin with LiDARs.

SLAM with LiDARs

A LiDAR is building a point cloud of the world. In a way, this "point cloud" can already be your map. Whenever you detect a wall, you could add the point clouds related to the wall into some sort of map. So every point of the cloud is part of the map, and the more you drive, the bigger the map.

A Map build with a LiDAR (source)

On the other hand, a camera needs to go through an intermediate process. Your data is 2D based (pixels), and you want to convert that to the 3D world. You also want to have keypoints that'll be considered as point clouds. This is a field called Visual SLAM, and it's usually done using Computer Vision features, such as ORB, BRISK, AKAZE, SURF, ...

A Map built with a camera and visual keypoints (openVSLAM algorithm)

Your features are similar to the point clouds from the LiDAR.

Other more advanced techniques leverage Neural Radiance Fields and 3D Reconstruction to convert each pixel into a point cloud and therefore drive in a dense environment (as opposed to sparse features). In this case, it's not just a few features that form a map, it's every pixel.

Defining your position in a map

The first step was "mapping", and our second step is localization. In this step, we want to define our relative position to each point in the map. How far are we from this wall in front of us, and from this fence on the side?

The role of localization is to look at all the landmarks we have, and estimate a position that will be true for all landmarks. So if you're estimating to be 5 meters in front of a wall, but at the same time 2 meters in front of another object, your only possible position is at X.

In Yellow, the possible probability location of the robot. In red, the measurements. In blue, the features or points detected.

This step is most of the time solved with 4 types of algorithms:

Kalman Filters
Particle Filters
Information Filters
Graph Optimization

This is how you have algorithms like EKF-SLAM, Fast SLAM, GraphSLAM, etc... It's also important to note that there are several sub-approaches done here. For example, FastSLAM uses Particle Filters to estimate the position of the robot and landmarks.

Keeping the map up to date

The final step is about continuously feeding and updating the map without being wrong. Your position estimate may be less and less precise over time, and your map measurements as well. Especially in the case where you visit some positions twice in a row, and you may have superposition of point clouds.

This is a problem known as Loop Closure, and I highly recommend reading my article on Loop Closure, and my other one on Point Cloud Registration, to "get" that part.

Left: What happens when you keep scanning an environment without optimizing. Right: How the map gets optimized with Loop Closure.

Another example in action:

SLAM and Loop Closure at Kitware (source)

So this is one other important task in SLAM. Next, let's see what type of skills you need to build.

What are the skills and background needed?

What would be the best way to understand the skills you need to get one of these SLAM Engineer Jobs? Probably to look at job offers!

I selected 3 job offers mentioning SLAM Engineer in the title. One is from a company called FoxRobotics, one is from Tesla, and the last one is from a warehouse management company called Symbotic.

What you'll do

The "What" part of 3 SLAM Engineer Job Offers at Symbotic, Tesla, and Apple

As you can see in the "What" part, we have several responsibilities, but it's not absolutely trivial to understand. What does it mean to improve the algorithms? Or to stay up-to-date with approaches? The only way to truely understand it, is to picture the key role of SLAM at the company you'll work on.

For example, if you choose to work with Symbotic, you'll work on autonomous robot inside warehouses like those of Walmart and Amazon.

Now, let's see what you need to do:

Your requirements

Let's see the rest of the job offers, starting with Symbotic, Tesla, and Apple:

The Skills part of the 3 SLAM Engineer Job Offers. Notice the redundent elements.

As you can see, we have several redundant elements, such as:

The languages: C++ and Python
The background: Master's Degree (in Computer Science, or specialized in AI or Robotics), Strong mathematical fundamentals
The fields: Computer Vision and Sensor Fusion skills, background with autonomous robots, Machine Learning, eSLAM, LiDARs, ...
The hard skills: 3D Reconstruction (Structure From Motion, ...), SLAM Algorithms (EKF-SLAM, Graph SLAM, Bundle Adjustment...), Libraries (OpenCV, ROS, GSLAM, ...), State Estimation (Visual Odometry, MAP, ...)

So this is what you need to know. Now let's see some SLAM Engineers in action.

Example 1: SLAM at Kitware Europe

Kitware is a company known for creating and supporting open-source software in various domains such as scientific computing, medical imaging, computer vision, data and analytics, and quality software process. Through their various offers, they have one named "LiDARView", in which they create a Point Cloud Visualizer, and compatible with SLAM.

So let's see how they do "SLAM" with LiDARView:

SLAM using LiDAR View software at Kitware (source)

Looking at their Open Source code, you may see a list of files:

The Code for SLAM at Kitware gives many hints on the type of skills we should build (source)

This is Part 1 of working as a SLAM Engineer: Building a SLAM Library for your company.

In this case, they're building a SLAM Library based using C++ (cxx = c++), and they're using several components, such as Voxelization, Keypoints Estimation & Matching, April Tag Detection.

Let me briefly describe how they SLAM algorithm works:

Keypoint Extraction: They extract keypoints on point clouds (from LiDAR).
Ego Motion: Using the point cloud at time (t), and the one at time (t-1), the software computes a "shift" from t-1 to t, and tries to recover the motion of the LiDAR (and thus the car), by using a closest point matching function.
Localization: They add the extracted keypoints into a map, as well as the new position calculated by ego-motion.

They then run this in a loop to update the map, and add elements such as Loop Closure to refine the map on revisits.

Hidden Skills inside

LiDAR Processing: Most of the work is done by processing point clouds. The two main skills are Point Clouds (as taught in my Point Clouds Conqueror course), and SLAM (as taught in my SLAM course).
Geometry: There are many rotations and translations being done in the repository, which implies you NEED to have these in your skillset. You can also see spline interpolations for ego motion prediction.
Maths: SLAM is a complex problem which often involves heavy math skills. It is worth putting it apart of geometry, as you may have matrix factorizations and other elements involved.
Graph SLAM Optimization: There are many SLAM algorithms available. Unless mistaken, this algorithm is implementing the "Graph SLAM", and therefore any skill in graph optimization is useful.
C++ and ROS: Worth mentioning, but the code is implemented in C++, running on ROS — which implies you'll need to know these two in this case. In many SLAM projects, C++ is the dominant language.

As you can see, we have these 5 main SLAM-related skills, in this specific case.

😮

Software Engineer Skills for Kitware: LiDAR Processing, Geometry, Maths, Graph SLAM Optimization, C++, ROS, LiDAR SLAM Approaches.

Let's see a different example, with different skills:

Example 2: ORB SLAM v3

You've seen a graph-based LiDAR application of SLAM. Among LiDAR approaches, you can also use Kalman Fitlers, or other types of filters. And among SLAM itself, you can also use Visual SLAM. Visual SLAM happens with a camera, so let's see how this works...

For example, ORB-SLAM (Oriented FAST and Rotated BRIEF SLAM) is a SLAM algorithm working on camera, and doing the following:

Feature Detection & Matching: Find Computer Vision features like corners and edges, and implement an algorithm to detect and track these features. In this case, we're using FAST keypoints and BRIEF descriptors.
Pose Estimation: Implement a graph SLAM algorithm to estimate the pose of the robot by matching the extracted features from frame to frame.
Mapping: Position the extracted features in a map.
Loop Closure: Use loop closure algorithms to remove redundancies with features.
Optimization: Perform bundle adjustment and optimize the graph globally to update the map

In this case, you may notice a very similar pattern than in the previous algorithm (it's a graph SLAM approach again). The main difference is that this time, rather than finding LiDAR keypoints, we're finding camera keypoints. The loop closure & optimization steps are likely done on the Kitware example too.

Additional Hidden Skills in Visual SLAM

Feature Detection and Matching: Knowledge of ORB (Oriented FAST and Rotated BRIEF) features and other feature descriptors (Harris corners, BRIEF, etc...).
Camera Models and Calibration: Understanding intrinsic and extrinsic camera parameters and how to calibrate cameras.
3D Reconstruction: Many Visual SLAM algorithm use 2-View Reconstruction, especially when working in stereo setups.
Sensor Fusion: Especially, fusion with sensors like GPS, IMU, Odometers, or others that are often used in Visual SLAM or other approaches (in fact, most SLAM algorithms use these sensors).

So, as a quick summary, we have...

😮

Software Engineer Skills for ORBSLAM Projects: Computer Vision (Feature Detection & Tracking), 3D Reconstruction, Maths, Camera Calibration, Sensor Fusion, Graph Optimization, Mapping.

Isn't it too hard and specific? And can't I just use these?

Is it too hard? Yes, SLAM is one of the hardest topics in the field of Robotics. It's (in my opinion) much harder to understand a SLAM algorithm than it is to understand any Neural Network architecture.

However, SLAM is re-using many existing skills you build in Computer Vision and Perception, which makes it an "okay to get" skill if you already have these skills. I would therefore recommend to learn these skills for what they are first (3D reconstruction, feature detection, ...), and then learn them in the context of SLAM.

Can't you just use these? The same way you can use object detection algorithms as black boxes, you can use SLAM algorithms as black boxes. For this, you can use existing software, like the ones from Kitware, or clone GitHub repository. Making it work with your hardware stack may on the other hand be much harder than when using other types of algorithms.

Some companies will require you to use these as applications. In this case, you may not "need" to master every single algorithm inside, but you'll still eventually get there. Using an algorithm rather than another requires understanding of the algorithm. Therefore, if you:

Change your LiDAR keypoint estimator
Use a different type of visual feature descriptor
Change from online to offline SLAM
Change from Kalman Filters to Particle Filters
...

You'll know why you're doing it, the underlying principles behind these approaches, and how to make them work.

So let's summarize:

Summary & Next Steps

Being a SLAM Engineer is a good career choice, as much fewer engineers are able to specialize there. It offers a wide range of companies, working on robotics (Robots, Cars, Drones, ...) or not (VR, Computer Vision, Geomapping, ...), and can be a lucrative option.
There are 2 types of sensors used in SLAM: LiDARs and Cameras. LiDAR-based approaches work by tracking 3D Point Clouds and putting them on a map. Camera-based approaches work similarly, but the points to track are 2D computer vision features (like edges and corners) projected to 3D.
Analyzing most job offers, we can deduct that working as a SLAM Engineer involves coding, testing approaches, optimizing algorithms, modifying environments, and making things fast and robust.
We also understand that SLAM requires the following hard skills: C++ and Python, 3D Reconstruction (Structure From Motion), Point Clouds Processing, working with Libraries, Sensor Fusion, and more...
As a SLAM Engineer, you may be writing SLAM code, or using it. Writing code for SLAM will require you to dive deeper into the specific algorithms used, such as keypoint detection on point clouds, or frame to frame ego motion estimation.
However, using SLAM algorithm will still require you to build a strong understanding of these algorithms. The way around is difficult in this field.
If LiDAR Engineers are scarce, it's for a reason. SLAM is a difficult topic, and I recommend to learn the core algorithms before you learn SLAM.

If, like me, you like to meet autonomous driving companies and engineers, you may be surprised to see that not everybody works in AI & Deep Learning. There are engineers working on connectivity, planning, control command, and yes, SLAM and Localization. These engineers have specialities, they have an "edge" where many AI engineers look alike.

But specializing in these topics means closing the door to AI and Computer Vision. Unless you work on SLAM. Because SLAM relies so much on Perception, it's the perfect tradeoff between to build a scarce profile, while still working extensively on AI and Perception.

Next Steps

📥

Want to move build expertise in Perception & Localization? Receive my Daily Emails, and get continuous training on Computer Vision & Autonomous Tech. Each day, you'll receive one new email, sharing some information from the field, whether it's a technical content, a story from the inside, or tips to break into this world; we got you. You can receive the emails here.

High-Level vs Low-Level Motion Planning for Self-Driving Cars

Jeremy Cohen — Mon, 16 Oct 2023 09:36:24 GMT

The McDonalds brothers got out of their cars, and looked at Ray Kroc:
— "Franchise!" he yelled.
— "Beg your pardon?" replied one in confusion.

— "Franchise! Franchise the damn thing! It's too damn good for just one location. There should be McDonald's everywhere! Coast to coast. Sea to shining sea... [...] When I saw these lines, your whole operation, tasted your product, I knew what needed to happen. Franchise! Franchise! Franchise! Franchise! Franchise!"

In the movie "The Founder" you can view this scene where Ray Kroc, the visionnaire behind McDonalds, is pitching the idea of franchising to the McDonalds brothers for the first time. Yet, while it was (obviously) the right thing to do, the first reaction from the brothers was negative, and they even said how they already tried. So why was Ray the one who could make it happen?

Because Ray was a strategist, while the McDonalds brothers were tacticians. In many business lessons, we are taught that skills should be complementary — the truth is, the best combination is when mixing strategists, with an incredible vision of what a product can bring, and tacticians, with an incredible know-how.

This idea, which is behind building startups, is also behind the implementation of the brain of self-driving cars: Motion Planning.

A self-driving car is built around 4 Pillars: Perception (see objects, roads, ...), Localization (place yourself in a map), Planning (drive in the map), and Control (move the vehicle).

The Motion Planning problem may look like a set of elements such as "bezier curve fitting" or "finite state machines", or even "A* and Dijkstra!!!", but it's actually more complex that this. There are 2 main sub-topics to handle:

High-Level Motion Planning: How you plan a route from A to B. (strategist)
Low-Level Motion Planning: How you overtake a vehicle, generate trajectories, etc... — aka the Tactician. (tactician)

In this article, I'd like to talk about these 2 elements of path planning, and in a third point, I'll talk about edge cases.

So let's begin:

High-Level Motion Planning

What are the main steps when going anywhere? You pull out your phone, use a GPS app to decide on the main streets you'll take, and then, at every time, decide how fast you're driving, if you should overtake, etc... The GPS part is high level, but the "overtaking" and "stop at red light" part is low-level.

We will need two things here:

A map of the world – referred to as HD Maps.
A motion planning algorithm to create a route from A to B

So let's start with the maps.

HD Maps

So what is an HD Map? Well, it's a map, like you have on Waze/Google Map, but with a centimeter level accuracy. The difference is: everything is mapped. Not just a street, but also how many lanes it got, the width of each lane, the position of every traffic sign in this lane, the location of bumpers, etc...

We usually build these using LiDARs, and some companies like HERE specialise in mapping and can get you a map. I note: it's a manual process you do once and reuse all the time.

So what does it look like? The first time I got into a self-driving car, I could actually see one. The car had a screen, with an interface showing a visualizer, and a software named "QGIS". What it is? It was a satellite software, where you can open maps, add your GPS position, and therefore drive in the map.

Let's see it in action:

The HD Map showing a live intersection at a roundabout

See the red area on the left? and values for the car driving in the roundabout? Well, the map has 100x more info than this. Every lane has an ideal trajectory, with nodes or checkpoints every 10 meters.

How can you use it for autonomous driving? When you say "I want to get to the coffee shop" you actually say "I'm on Node 238 and I want to get to node 1129". Then, an algorithm plans for the route from node to node.

One quick note: Maps can also be dynamic. This is what mobileye does in their "AV Map". Rather than doing the mapping once, they let their fleet do it continuously. What happens is, they sell ADAS devices to carmakers, and millions of vehicles are equipped with their tech. Then, the process is simple: tons of cars collect information, send it back to a server, and then all the data is aggregated and combined into one, random forest style!

Different landmarks (road lines, signs, lights ...) detected by different cars will be later aggregated inside one map (source)

So this goes into the map I showed earlier. So now let's see the actual algorithms.

The High-Level Planning Algorithms

See the map as a graph, with nodes. What is the first "shortest path" algorithm that comes into mind to go from A to B? You guessed it! Dijkstra! Dijkstra is a search algorithm, and it's probably the first one everybody learns at school and in computers.

There are 3 families of search algorithms that people use:

Discrete Planners: Algorithms that will discretize the environment into cells, and then create a trajectory from A to B. Algorithms like Dijkstra, Breadth First Search (BFS), Depth First Search (DFS), or A* and Hybrid A* are the main ones. I would say most implemented robot motion planning algorithms use a version of A* as v1 (and iterate from there).
Sample-Based Planners: Sampling based algorithms don't explore the entire map, but will generate enough "samples" (or nodes) so you can create good enough path from A to B, going through the generates nodes. Some algorithms like Probabilistic Roadmap (PRM) and Rapidly Exploring Random Tree (RTT) or their evolved versions (like RRT*) are very popular in this field.
Probabilistic Planners: Any algorithm based on AI, Reinforcement Learning (Q Learning, Value/Policy Iteration, ...) is belonging to this family. These algorithm use experience and neural networks to plan the high-level route.

In a post on Tesla's End-To-End algorithm, I shared the "Monte-Carlo" method they're using in the planner of FSD11. Intuitively, where would you fit it? Well, because it creates a "tree", we could fit it in #2 on Sample-Based approaches. But it's worth noting it also uses a Neural Network, and could as well fall in category #3.

Video of the Tesla Planning Algorithm in 2021 is super fast thanks to a Neural Network + Monte Carlo Combo (source: AI Day 2021 Video)

If you want a cool visualization, there's also this Github repo that has implemented many of the algorithms so we can visualize them.

Visualization of multiple types of High-Level Planners (source)

So, we now know how to build a map, and then generate samples and drive from A to B. Now let's see low-level planning.

Low-Level Planning

High-level planning tells us that we should drive from node 238 to 255 passing through 239, 240, etc... Low-Level planning tells us how we'll do it. There are several ways to do it, some of them are explained in my post on Robot Path Planning (which is mostly about Low-Level Planning).

Level 1: GPS Driving 📍

Theoretically, if you have an accurate GPS position, and a good object detection system, you could drive in the map with the GPS alone. You don't even need to detect the STOP signs etc... because it's all manually mentioned.

Your GPS and map say that if you're at node 238, you should drive at 30 km/h, and that at node 239, you will have a bumper and should slow down to 25. Then it knows that from 239 to 288, you will be able to drive at 50 km/h, etc, etc...

An example of nodes in an HD Map

You will still need a few elements, such as:

A lane detection algorithm. Even though you use the GPS, you still want to drive in the middle of the lane, and a lane detection algorithm is going to help you steer the vehicle left and right to keep your center lane. You can learn about some anti-OpenCV lane detection approaches through this post.
An obstacle detection algorithm, or at least, a LiDAR obstacle avoidance system. This is mostly when you'll have obstacles, or anything dynamic like traffic lights (unlike stop signs).

Now, this won't get you very far, it's the 'Level 1' of autonomous driving, but it's very possible to implement a last mile delivery robotaxi or shuttle with these alone. Now, if you ever go in a tunnel, or if you don't have access to a GPS with RTK correction (meaning, it'll be super imprecise), it will all fail.

So let's see Level 2, which involves moving from GPS to trajectory planning.

Level 2: Cost Functions 📉

The problem with GPS driving is, it heavily relies on GPS, and it's a bit dumb. If you're stuck in front of a parked vehicle, it's game over. So, at this point, you'll start by adding an "overtaking" function, that will help you generate a path to overtake stopped vehicles, but it will become increasingly complicated with the number of "exceptions".

This is when you'll switch from basic GPS following to cost functions. And this is done via 3 steps:

Path Prediction: It will predict the path of every single vehicle your perception system detects. This is pure motion forecasting. It can be done with techniques like Kalman Filters, or with Deep Learning algorithms. Companies like Waymo release several papers per year on this.
Trajectory Generation: From where you are, you'll want to generate a set of possible trajectories to go to. For example, if you want to overtake a vehicle, there are many "rules" you need to follow, such as yielding to existing vehicles on the adjacent lane, deciding on the speed to adopt, etc... You'll generate a trajectory per option (and yes, there can be dozens).
Decision Making: At this step, each trajectory will have a "cost" associated to it, and your job will be to pick the lowest cost trajectory. This cost will be a weighted function of criterias such as comfort, collision probability, manual intervention likelihood, speed (it's best to drive at maximum speed when you can), etc...

For example, this is what the overtaking situation would look like on a slow vehicle. This schematics actually lacks a critical option to stay in its own lane (which is probably favored here), but this at least illustrates the point:

An overtaking situation (source)

Why doesn't our motion planner have a million trajectories then? Because on top of our cost functions, we can already eliminate a whole number of trajectories that will for sure have maximum cost. For example, we want a collision free motion, no trajectory that is 'around' another vehicle will be tolerated. We also want to respect the laws of physics -- we want a technically feasible path (we can't turn the wheels 90°, backward, etc...) in a defined configuration space.

So we have these static constraints, and we also have dynamic constraints, like the changing maximum speed, a dynamic environment with moving objects, traffic light changes, .... This is why, from a million motion planning strategies, you can actually reduce that to 3 or 4 actual paths.

From there, each trajectory has a cost associated to it, and the role of your motion planning algorithm is to pick the lowest cost trajectory. Once this is done, you pass that information (a continuous path) to the Control module, that will use is to move the car.

Yet, this isn't the most advanced we can do...

Level 3: End-To-End & Renegade Approaches ⛔️

Most startups begin with the GPS scenario, and quickly evolve to the Cost Functions scenario because of a client need, or a desire to fill more complex tasks (roundabout, tunnel, etc...), or to drive in more regions.

And then there are the renegades, that either implement a hardcore version of the cost functions (like Tesla does — explanation here), or they use approaches like Reinforcement Learning or End-To-End Learning (like Wayve), and this changes the entire thing.

In these approaches, you don't handcode anything. There is still the original map and the high-level stuff, but the low-level is done purely using AI. This means that you don't tell the car "you should drive at maximum speed", you let it figure it out on its own based on what others did.

An example with the Wayve "MILE" (Model-based Imitation Learning) algorithm, that takes all the data as input, the GPS info as well, and outputs a driving decision directly:

Wayve's End-To-End Approach (source: Wayve.ai). Note: the depth maps etc... are used for human visualization.

When talking about Renegade Approaches, what I'm talking about is actually End-To-End Learning, and there are 2 types of these:

Imitation Learning: Learning from demonstrations via Behavioral Cloning or Inverse Optimal Control.
Reinforcement Learning: Learning from experience via Q-Learning, Value/Policy Iteration, or other RL approaches.

I'll likely detail these in a separate post about Imitation Learning & Reinforcement Learning. If you're already reading my private emails, you already have access to what I'm going to write about.

So we've seen high-level and low-level, now let's see how to use these together...

Edge Cases

In this last point, I want to highlight the difficulties in making these work. Indeed, any of these high-level and low-level strategies will have difficulties to overcome. We often call these edge-cases.

There are hundreds of edge cases and opportunities for failure. And these can come from anywhere, whether it's Perception (phantom obstacles, misclassifications, etc...), or a wrong Localization (GPS off, clouds, etc...). It's important to understand what are your limitations if you're using a type of planning.

For example, if you're using the Cost Function Approaches, you may need to deal with edge cases on after the other. An example from Waymo in 2019, where they talked about solving the reflection problem.

Waymo's Reflection Problem (source)

Consider that people could be holding stop signs, that roads can be blocked, or that humans could do the traffic management instead of signs, and all of this needs to be handled. It's very difficult to estimate the "source" of each edge case, so this mean that you'd need to deal with them one-by-one.

On the other hand, the Tesla and Wayve new approaches (End-To-End) will try to solve it with data. If you face have a situation you've never seen before, you'll try to match it to the closest element in the dataset (a bit like humans do when they drive abroad).

Alright, so this sums up this little 'note' on edge-cases. We could say much more about this, but I'll probably dedicate an entire article to it.

Let's summarise:

Summary & Next Steps

A self-driving car is built around 4 Pilllars: Perception, Localization, Planning, and Control.
Planning is divided into high-level and low-level motion planning. High Level Planning is the task of planning a route from A to B, and Low-Level planning is more like planning each sub-trajectory within that route (which speed to use, how to overtake obstacles, ...)
High-Level Motion Planning is about 2 elements: The HD Map and the algorithms. The HD Map can be done once, or continuously be updated by a fleet.
The High Level Planners are of 3 types: Discrete (Dijkstra, A*, ...) Sample Based (RRT*, PRM, ...), and Probabilistic (Reinforcement Learning, ...).
Low Level Planning has 3 levels of difficulty: GPS (1), Cost Functions (2), and the Renegade End-To-End (3). GPS is about following a GPS line and stopping at objects. Cost Functions is about generating trajectories and taking the lowest cost one, and End-To-End is about imitating a dataset of previous drivers.
The simpler your problem, the simpler your level of low-level planning. If you drive in a delimitated route with few options, you may do really well with GPS Driving. In most cases, you may do ok with Cost Functions. In complex and everchanging environments, you may want to research End-To-End.

The task of implementing the brain of a car can be very difficult. This post covers the two types of approaches, strategist and tactician. It's interesting to realize that each of us is often either a strategist (more thinker, planner, see high-level, vision based) or a tactician (very implementation based, want to do more tasks, etc...). The magic happens when you combine the two together.

📥

How to keep learning more about this?
Receive my Daily Emails, and get continuous training on Computer Vision & Autonomous Tech. Each day, you'll receive one new email, sharing some information from the field, whether it's a technical content, a story from the inside, or tips to break into this world; we got you.
You can receive the emails here.

Lane Detection: The 3 types of Deep Learning (non-OpenCV) algorithms

Jeremy Cohen — Mon, 09 Oct 2023 11:19:02 GMT

One day of 2014, I was comfortably lying on the sofa, looking at the firecamp of a luxury ski villa I rented with friends in Canada, when suddenly, something popped in my field of view:

"FIRE! FIRE! FIRE!" I yelled.

Within seconds, my friends and I gathered in the kitchen, and looked at our french fries taking fire and spreading in the wooden house. It was total panic. We had to find a way to avoid the fire from spreading to the entire house. I tried to throw a glass of water, but nothing stopped it. The fire was getting bigger...

Suddenly someone had a clever reaction: to use a fire extinguisher. It took my friend a long minute to cut the ropes and have it ready to work, but then he was able to extinct the fire in a second. "Pshhhhhtttt". This day, the fire didn't spread, and we could all have an amazing ski weekend, all thanks to a good immediate reaction.

Whether you're trying to stop a fire, or asked to build a self-driving car, the first thing you think of is often the leading and most important thing. When developing an algorithm, thinking wrongly first can set you in the wrong direction.

The first thing people think of when building a self-driving car is to implement lane detection. It's what Tesla and mobileye note as "mandatory", and what Sebastian Thrun (godfather of self-driving cars) notted as first, essential task in an interview.

Yet, if this is a great first reaction, the first intention of many engineers may not be the best. Indeed, I noticed most posts were explaining traditional OpenCV algorithms, made of very old functions no longer in use. This is why I wanted to create this post on Deep Learning, which is today what companies use. In fact, there are mainly 3 types of approaches that companies use to perform lane detection:

Segmentation Approaches
Anchor-Based Approaches
Parameter Based Approaches

A note: When I talk about "lane detection", I will actually talk about the idea of "lane line detection". In research and common knowledge, these two are assimilated, although there should be different.

Let's take a look at how to detect lane lines!

Segmentation Approaches for Lane Line Detection

The first and most popular way to detect lane lines is by using image segmentation. It's actually something I worked on for many months as a self-driving car engineer, because it was essential, and because an algorithm name LaneNet was considered the best way to do it.

We have 3 things to understand:

How to frame lane line detection as a segmentation problem
Which architectures can be used?
How to do curve fitting from segmentation masks?

How to frame lane detection as a segmentation problem

If you're used to image segmentation, you know what I mean: both our input and output are images. In fact, the output image is a segmentation mask with lane markings painted, just as at the bottom part of the image below:

Top: Input Image | Middle: Prediction | Bottom: Labels

As you can see, our input is a set of images, but out labels are the same images, with overlayed segmented lines. Here, each color representing one line. This means we're not doing just image segmentation, but instance segmentation. Thanks to this, we can tell apart the left and right lines, but also solid from dashed, curves from straight, etc...

Which architectures can be used?

Most models used to perform segmentation are Encoder-Decoder architectures, and based on the dataset and labelling, you'll have more or less classes. Your task is to do pixel-wise instance classification (to classify each pixel).

The most popular architecture from this category (and maybe even in lane line detection) is LaneNet. So I'm going to do a deep dive on LaneNet, and then explore a few other models.

LaneNet Deep Dive

LaneNet Model (source)

The input is an image, then is goes through an E-Net encoder (a pretrained image segmentation encoder), and then, two heads:

The bottom head is a set of upsampling layers drawing a binary segmentation mask. This mask is binary, and simply answers the question "does this pixel belong to a line?"
The top head is a set of upsampling layers doing pixel embeddings. This term is intimidating, it's actually doing instance segmentation, which involves assigning a unique embedding (a set of feature values) to each pixel so that pixels belonging to the same lane line have similar embeddings and pixels belonging to different lane lines have dissimilar embeddings.

Here is, for example, a better visualization of these "embeddings".

The Embedding Visualization is based on a clustering approach

We then combine these two and use post clustering (like DBSCAN) to combine the lines that have similar embeddings (features) together. So this was LaneNet, with one branch classifiying "line vs not line" and the other classifying "line A vs line B vs line C etc...", and then we cluster pixels from the same lines.

What other algorithms look like

LaneNet is the pioneer in this category, but other architectures work even better, such as SCNN, RESA, or CurveLane-NAS.

For example, SCNN (Spatial Convolutional Neural Network) enhances lane detection by employing spatial CNN layers, which enable it to consider horizontal connections between pixels, improving line continuity.
RESA (REcurrent Spatial Attention) leverages spatial attention modules and a recurrent feedback mechanism, ensuring robust lane prediction by focusing on relevant spatial features across sequential processing stages.
CurveLane-NAS uses Neural Architecture Search (NAS) to automatically identify optimal network architectures, ensuring the model is well-tuned for the specific task of curvilinear lane detection.

They all use different mechanisms inside, but the output will always be a segmentation mask.

3 Other segmentation approaches

Then comes the problem of converting a segmentation mask into something usable:

Curve Fitting

Imagine you're tackling your laundry: you diligently spend 15 minutes separating whites from colors, and an additional 5 minutes ensuring no delicate items are included. Carefully, you place everything into the washing machine and head off to work... only to realize en route that you forgot to actually start the wash cycle!

This is what happens when you're finding segmentation maps, but don't do curve fitting. Curve fitting is the process of fitting 2D or 3D curves to points from the same detected lanes.

There are 3 ways to do it:

Direct Curve Fitting: This means fitting a curve directly within the image space and it involves finding a polynomial that will fit to most x and y coordinates of the detected lane points in the 2D plane. While straightforward, this methodology grapples with concerns about real-world accuracy (when reprojecting to 3D space) and managing vanishing points effectively (when do you stop fitting?).
Bird-Eye-View Fitting: Most papers use Bird-Eye-View fitting. BEV typically employs homography, leveraging predetermined source and destination points to facilitate the conversion. Once transformed, a curve (commonly polynomial) is fitted to the lane points, offering a more straightforward, often more precise representation.
Neural Network Fitting: In some algorithms like LaneNet, a neural network (called the "H-Net") is used to either directly predict polynomial coefficients or determine transformation parameters. So, the neural network does the fitting or BEV transformation.

So these are for the segmentation based approaches:

Pros: They're simple to engineer and pixel-wise precise. They also rely on segmentation, so we can use transfer learning, and they also can detect infinite lanes without manually writing it.
Cons: They're slow (only a few FPS), and treat each pixel the same way, so they won't understand the relationship between pixels of a same line.

The speed problem alone can create issues in autonomous driving and ADAS (advanced driver assistance systems), because we may not be able to do real time lane detection...

So let's see the second approach:

Anchor Based Lane Detection

The second technique is anchor based, and this is where you'll want to, rather than segmenting, do a regression of some values. The most popular algorithms here are Line-CNN, LaneAtt, SGNet, and CondLaneNet.

Here, we need to understand 2 things:

What are anchors?
How an Anchor Lane Detection model works

What are anchors?

In any case (anchor or else), if you don't predict a mask, you'll predict numbers directly. But what numbers? Notice the use of the word "Anchor"... where have you seen this word before? Yes! in Object Detection! In object detection, you can sometimes use anchors that are predefined bounding boxes (a pedestrian is often vertical, small, etc...).

If you need a refresher, I have a great article on anchor boxes here.

In anchor based object detection, your model begins by placing anchor boxes all over the place. If detecting pedestrian for example, most of these boxes will be vertical boxes, and you use that knowledge as prior. You then predicts the "shift"or adjustment between detected bounding boxes and the anchor boxes.

Similarly, we can have assumptions on what a line should look like... It can be a straight line, or a curve, it can be dashed, or solid... and guess what? We can use this information as 'anchor'.

Intuitively, the key idea would look like this:

Anchor Generation — Anchors, which are predefined lines, are generated across the image. In lane detection, these might be lines or curves representing possible lane positions and shapes. They also might be created at different scales and aspect ratios to detect lanes of various sizes and orientations. Usually, we then use techniques like Intersection over Union (IoU) to match anchors to ground-truth lines/boxes.
Anchor "Shift"/Deviation: The model predicts adjustments to the anchors, tuning their position, shape, and size to closely align with the actual lanes in the image. In some cases, we may also want to predict class probabilities (e.g., solid vs. dashed lane) for each anchor.
Non-Maxima Suppression & Prediction: NMS removes redundant predictions by keeping only the anchor with the highest probability in regions where multiple anchors overlap and suppressing the others.The remaining anchors after NMS represent the model’s final predictions for the positions and shapes of lanes.

So, this is what happens in object detection, and we can model this for lines, just that we use different anchors. Most models more or less try to follow this, let's see for example LaneATT, which is the state of the art in this category.

How an Anchor Lane Detection model works

There are several models, so let me begin with the most common and state of the art, and then show you other approaches.

Deep Dive in the LaneATT Model

Take a look at the following image, it implements more or less what we described:

The LaneATT Model (source)

It may not be obvious, so try to note what happens here:

(step 0.) Feature Extraction: We take the image to a backbone CNN. (nothing to note here, it's a step 0 to keep consistency with previous part)

Anchor Generation: Here, we use Attention to generate the anchors. Rather than having anchors everywhere, we use attention on top of the feature maps to give locations where lines may be present. Attention is going to focus on planar and linear regions, and we'll then put our anchors here. It will avoid having lines in the sky and on regions where there won't be lines.
Anchor "Shift"/Deviation: For every anchor, we try to predict the deviation values to get the actual lane line. This is a classic regression. We also do a classification step, where we predict the type of line (dashed vs solid, etc...)
Non-Maxima Suppression: We run suppresion over all the overlapping detected lane lines lines, to get only a few.

So, if we look at the drawing again, it looks like this:

The modified LaneATT. In step 1, attention is used to tell us where to place anchors. In step 2, we regress line equations and predict the difference with the anchors, in step 3, we show the output.

And that is the example of how we can use anchors. In object detection, and I assume in lane line detection, anchors are based on the dataset you're using, and clustering made on all lines and their categories (if most lane lines are inclined to the right, we'll use it as anchor, etc...).

What other algorithms look like

Other techniques may do a bit differently, and for example, use row-wise classification, conditional prediction, multi-scale prediction, or other attention based models, ...

(modified from source)

The most advanced models like LaneATT use Attention, and some of them use RNNs to not only predict one point of a line, but a sequence of points. (hey! I was talking about it in the Optical Flow course, too bad if you didn't join!) So this concludes our second approach: we want to regress anchor deviations.

This is why I'd also like to mention CondLaneNet, which is an extension of LaneNet with an understanding of context (road, lane lines, etc...). It does so using several mechanisms, like Self-Attention, RNNs, and Conditional Convolutions.

See in this video how the model performs in a new environment:

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

Finally:

Parameter Based Lane Detection

In this final technique, we'll try to regress not deviation numbers, but polynomial coefficients. After all, a lane line is nothing but either a line, or a curve, right? So, rather than trying to define a lane line by a set of point, we can define it by an equation.

Si let's look at the simplest yet most powerful technique: The PolyLaneNet algorithm:

PolyLaneNet Algorithm (source)

Rather than anchors or segmentation masks, we assimilate the line as an equation (as we did for curve fitting), and try to regress the coefficients. This is faster, simpler but maybe difficult to generalize. It also raises two questions:

Which polynomial do we fit?
How many neurons should we use at the end?

Which polynomial do we fit?

If you're used to polynomial, you now we have several of these, for example:

First order: ax+b, the equation of a straight line
2nd order: ax² + bx + c, the equation of a simple curve
3rd order: ax³ + bx² + cx + d, the equation of a more complex curve

Which one is it for autonomous driving? Most algorithms use third order polynomial regression. It's the case for PolyLaneNet, but also LSTR, a Transformer based End-To-End model that also regress the polynomial coefficients.

How many neurons are used in the end?

That will depend on algorithms. If we fit a third order, we have 4 parameters to guess for each line (a, b, c, d). What I saw in PolyLaneNet was a fixed number of lines being used, let's say 4, and if we have less than 4, so this is 16 output neurons. If there are less lines, some neurons outputs will be disregarded.

What about the traditional OpenCV lane detection approach?

When I first started with Computer Vision and lane line detection, it was with the Udacity Self-Driving Car Nanodegree, which had its own lane line detection project. In fact, it had two, one with straight lines, and another with curves.

Both of these projects included traditional Computer Vision techniques line canny edge detection, hough transform, region of interest, turning into a grayscale image, run a sliding window, picking the high intensity values, and everything that will apply traditional algorithms on the original image. You can find many opencv approaches in Python online for these, but I prefer to tell you...

I tried to use that in an autonomous shuttle, it failed completely. Not only was it not robust enough to work on my scenarios and lines (in France), but it was also extremely slow. A few FPS maximum, it took forever on video frames, and this because of the use of traditional techniques.

Example of one of my projects:

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

Therefore tried other approaches, that were hybrid of traditional, and Deep Learning. One of them involved finding the driveable area, and then taking the contours as lines. The drivable area detection ran extremely fast (100 FPS on a 1070) and was very reliable, but also had generalization issues, especially when I took the contours.

Example:

Curve Fitting via Contour Extraction may work in some cases, but it's not generalizable

Unless you have a strong reason to use traditional approaches (other than lack of knowledge), I would recommend to go with Deep Learning. You have 3 approaches to go to, so let's now see some examples, and do a summary...

Example #1: Nvidia's LaneNet

Nvidia makes lots of videos where they showcase their work in Perception. One of these videos is about an algorithm they call LaneNet. It may be our LaneNet, or a custom one, but here is the explainer video and the results:

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

Example #2: Tesla's Lane Language

Tesla is also big on lane line detection. For that, they built an extension of their HydraNet, just for Lane Detection. This means they have heads to detect objects, traffic signs, etc... and one head that is another neural network just for lane lines.

Tesla's Language of Lines (source)

This extension is creating a "Language Of Lanes" that relies on tokens, embeddings, and learning continuity of a line. Yet, it seems much bigger than our other models, when asked about the complexity of their system:

ASHOK ELLUSWAMY (Autopilot Head): "We didn't do this just to create a complex model. We tried simple approaches, for example to segment the lanes along the road, but it cannot see the road clearly and tell a 2.5 lane situation."

When listening to their talk, my intuition was that their model was the #3 category, regressing coefficients of lines directly, but going through this language thing first. With End-To-End coming, this may disappear and simply become a lane line learning network, that helps the model predict the output, rather than predicting lane equations.

Summary & Next steps

We used to implement many traditional techniques for lane detection, we now mainly use 3: Segmentation Based, Anchor Based, Parameter Based.
Segmentation based approaches frame the problem as a segmentation problem; trying to map pixel values to classes, each class being a specific lane, or the background. Algorithms like LaneNet or SCNN are popular in this field.
Anchor based approaches mimic object detection with the use of anchors; but for lines. Our goal will be to define anchors, and then calculate a deviation from these anchors of the detected lines. The model LaneATT is state of the art, and uses Attention based anchor generation.
Parameters based approaches directly regress the polynomial equations of lines. We usually implement a third order polynomial (complex curve), with a fixed number of lines. Models like PolyLaneNet are top performers in this category.
Every company is trying its own approach, like Tesla or Nvidia. The most popular algorithm of them all (and available on ROS) is probably LaneNet.

If you ever see a fire somewhere, the extinguisher may be your automatic reflex. If you have to build a robot, or self-driving car, this extinguisher will be lane detection. You now have access to these 3 categories, but others may exist or come up in the future. This is the case for row-wise line detection, that gained lot of popularity in the previous years, and that I kept out of this post.

On the other hand, companies like mobileye recently announced they were no longer using lane line detection for autonomous driving. Instead of being guided by white or yellow lines on the road, the system now follows precisely the route that other vehicles were driving on that road based on previously collected data.. This means that it's again a new approach to driving, using a fleet.

Next Steps

First, you could try and scan 1 or 2 papers from each category, this will likely give you a good intuition of how the approaches work.

For segmentation papers, try to understand what are the categories they use? And how they do the classification?
For anchor based papers, try and go to the implementation of one of these papers, and look at what the anchors are, and how they define them.
For parameter based papers, try and see how the model generalizes and work with simple straight lines.

Second, lane detection is currently evolving to 3D, because fitting 2D curves is only an intermediate step before we project them to 3D. I might do a 3D line detection post, but in the meantime, I invite you to search for it, and try and do a similar clustering as I did for 2D (what are the different approaches? categories? differences with 2D?)

📥

Ultimate Next Step?
Receive my Daily Emails, and get continuous training on Computer Vision & Autonomous Tech. Each day, you'll receive one new email, sharing some information from the field, whether it's a technical content, a story from the inside, or tips to break into this world; we got you.

You can receive the emails here.

Breakdown: How Tesla will transition from Modular to End-To-End Deep Learning

Jeremy Cohen — Fri, 15 Sep 2023 10:40:57 GMT

December 2022, Tesla Headquarters, Palo Alto...

During a meeting, an Autopilot Engineer named Dhaval Shroff pitched a new idea to Elon Musk:

💡

"It's like Chat-GPT, but for cars!!!" "Instead of determining the proper path of the car based on rules, we determine the car's proper path by using neural network that learns from millions of training examples of what humans have done."

Convinced, Elon Musk and the entire Autopilot team rushed into applying a new algorithm to the next FSD Version (from FSD v11 to FSD v12).

And that was the beginning of End-To-End Learning at Tesla...

In this post, we're going to do a breakdown of how Tesla plans to migrate from its current architecture to a full end-to-end one. And for this, I'm going to cover 3 key points:

Tesla in 2021: Introducing the HydraNets
Tesla in 2022: Adding an Occupancy Network
Tesla after 2023: Transitioning to End-To-End Learning

These 3 points will matter, especially because you'll notice how they've actually been ready for End-To-End for a long time.

So let's begin with #1:

Tesla in 2021: Introducing the HydraNets

When it started, Tesla's autonomous driving system relied on a Mobileye system for its Perception module. Quickly, it went to a custom built system, going from rasterization of images, to Bird-Eye View Networks, and at some point, they introduced a multi-task learning algorithm called the HydraNet.

The goal of the HydraNet is to have one network with several heads. So, instead of stacking 20+ networks, we have just one, and don't repeat many encoding operations.

Tesla's HydraNet (source)

This model has been a key element in the Tesla's architecture, because it allowed them to run multiple tasks using a single neural network.

Yet, it's worth reminding that this was just one component (the Perception component) of a larger modular architecture made of several blocks. Back when they explained it, they mentioned 2 main blocks:

Perception, used to detect objects and environment with the HydraNet
Planning & Control, used to plan a route and drive

Perception & The HydraNet

So the HydraNet is similar to what I just showed. I have an entire post on it on my dedicated article, and even an entire course just on multi-task learning and how to build these models on Computer Vision.

So let me quickly move to the next part:

Planning & Control: How Tesla Autopilot plans a route to destination

At AI Day 2021, Tesla introduced 3 types of Planning systems, and showed the example they ran on an autonomous parking example:

1) The Traditional A* Algorithm — known by every robotics engineer as the "key" algorithm in search. This example shows nearly 400,000 "expansions" before finding the correct path.

0:00

Video of the Original A* Algorithm for Search takes around 44,000 nodes to compute (source: AI Day 2021 Video)

2) The A* Algorithm, enhanced with a Navigation route — This approach is the same one, but rather than a "heuristic" given, we give more inputs (the map? the routes? the destination?). This leads to only 22,000 expansions.

0:00

Video of the A* Algorithm, enhanced with Navigation information takes only 22k nodes (source: AI Day 2021 Video)

3) A Monte-Carlo Tree Search Algorithm, enhanced with a Neural Network — This approach is the one used by Tesla in 2021. By combining a Neural Network, with the traditional tree search, it can reach the objective with less than 300 nodes.

0:00

Video of the Tesla Planning Algorithm in 2021 is super fast thanks to a Neural Network + Monte Carlo Combo (source: AI Day 2021 Video)

This last approach is the one they're using for Planning.

Summary of Tesla in 2021

So let's do a quick recap of 2021:

Tesla's Architecture in 2021 (source: AI Day 2021 Video)

We have 2 main blocks: The HydraNet doing the vision task, and the Monte-Carlo combined with a Neural Network to do the planning task.

Now let's move to 2022, and see how both Perception and Planning evolved...

Tesla in 2022: Introducing The Occupancy Networks

In 2022, Tesla introduced a new algorithm they call the Occupancy Network. Thanks to the Occupancy Network, they've been able to not only improve perception, but also highly improve planning.

When I first talked about this idea, many people told me "So they're no longer using HydraNets!".

Well, they are!

What Tesla did in 2022 is to 'add' an occupancy block to Perception, so they divided the Perception module in 2 blocks:

The HydraNet, finding objects, signs, lane lines etc...
The Occupancy Network, finding the occupancy in 3D

So let's see these 2:

The Occupancy Network

Let's begin with the Occupancy Network.

In my article on Occupancy Networks, I talked about how Tesla has been able to create a Network that converts the image space into Voxels, and then assign a free/occupied value to each.

This enhances the perception block, helping them find more relevant features too. This has been a really great improvement to their stack, as it allowed them to add a great context understanding, and this especially in 3D.

The added the Occupancy Network (2022) (source: AI Day 2022 Video)

This Occupancy Network is predicting the "Occupancy Volume" and "Occupancy Flow" — 2 important pieces, that help us understand what's free and what's occupied in the 3D space. If you want to understand it better, you can read my article.

But it's not the only thing used, it's an additional piece, to understand 3D better, the objects and lanes are also detected with the HydraNet.

The HydraNet 2.0: Lanes & Objects

The HydraNet isn't much different than last year, except that it has an additional "head", that isn't really a head, but a complete Neural Network doing a lane line detection task.

The overall architecture looks like this (heads in red):

The HydraNet in 2022 (source: AI Day 2022 Video)

Now, one part of the architecture (in green) is actually another Neural Network stacked here to find the lane lines. When looking at it, it looks like this:

The Lane Detection "Head" (source: AI Day 2022 Video)

I want to stay "high level" in this article, so I won't dive into how they steal from natural language processing and use semantics etc... to determine lanes, but do you see how the left part is the same as on the above image? It's all a HydraNet, and the right part is just an additional part for lane line detection.

So, this is for Perception, we have:

An Occupancy Network, doing 3D understanding
A HydraNet, doing lane & objects understanding

Now let's see Planning:

The New Tesla Planner

The original Neural Network planning was great, but now that we get access to the Occupancy in 3D, we should use it! So, the new planning module integrates the Occupancy, as well as the lanes, into (still) an architecture made of a Monte-Carlo Tree Search & a Neural Network.

To understand the logic of the Tree, let's take the example Tesla gave at their AI Day 2022:

Intersection Example (source: AI Day 2022 Video)

In this example, the vehicle has to:

Yield to the pedestrian crossing illegally
Yield to the car coming from the right

And, so, it build a tree structure, where it will generate and evaluate exactly these choices.

The Monte Carlo Tree Planner in 2022

From top to bottom:

Step 1: We start with the vision measurements (lanes, objects, and occupancy)
Step 2-3: We then generate goal candidates and trajectories
Step 4: We evaluate the first choice: pedestrian yield vs assert
Step 5: We evaluate the second choice: Right car yield vs assert

How do we evaluate these choices? We have manual rules and criterias for this. After we've generated trajectories, each of these will have a cost function, that will depend on 4 factors: the collision probability, the comfort level, the intervention likelihood, and the human-likeness.

Tesla's Trajectory Scoring & Picking Criterias

So, if you have 20 trajectories generated, each of them has a cost function, and you end up picking the lowest cost function.

Again, this is all "manually" done. To my knowledge, there are no machine learning models or deep neural networks used, no training data — we really write rules and algorithms.

So let's see a quick summary of what we have:

Tesla in 2022: Summary

Tesla FSD in 2022

We have 2 key components:

The Perception has been enhanced with Occupancy Networks and Lane Detection
The Planner has been rewritten to use the Occupancy Networks outputs

So you can see how the input data (8 images) flows from Perception to Planning, to the output.

Now, let's see what they want to do for 2023/2024:

Transitioning to FSD v12 and the End-To-End Architecture

What does End-To-End Deep Learning mean? And what "changes" to this architecture will it require?

If we google "end to end learning definition", this is what we got:

"End-to-end learning refers to training a possibly complex learning system by applying gradient-based learning to the system as a whole. End-to-end learning systems are specifically designed so that all modules are differentiable."

So, in a nutshell, Tesla has 2 things to do:

A Deep Neural Network for every block
An End to End Model, assembling these neural networks together

Currently, for Tesla:

Perception uses Deep Learning ✅
Planning uses a combo of a Deep Learning model + a traditional Tree Search ❌

And this is this Planning part that they'll need to turn into a Deep Learning part. They'll have to get rid of the trajectory scoring, the manual rules, the code saying "if you're at a stop sign, wait 3 seconds", and the code saying "if you see a red light, slow down and break.". All of it is gone!

So, here is what it'll look like:

The first part will require the Planning system to be fully using Deep Learning

Yet, it won't be enough, because if we may have 2 Deep Learning blocks, but still need an End-To-End operation. So there's full Deep Learning, and there's End-To-End.

In a non-end-to-end, yet full Deep Learning setup:

You train Block A independently on a dataset to recognize objects.
You then train Block B independently, using the output from Block A, to predict a trajectory.

Key point: During training, Block A does not know anything about the objectives of Block B. And Block B is also unaware of Block A's objectives. They are two separate entities trained independently, and their training loss is not jointly optimized.

Now consider an end-to-end setup with the same Blocks A and B.

You have a single objective function that considers both recognizing the objects in the image (Block A's task) and predicting the trajectory (Block B's task).
You train both Block A and Block B together to minimize this joint loss.

Key point: Information (and gradients during backpropagation) flows from the final output all the way back to the initial input. Block A's learning is directly influenced by how well Block B performs its task, and vice versa. They are jointly optimized for a single, unified objective.

The Transition from Full Deep Learning to End-To-End

So, the main difference is not in the blocks themselves but in how they are trained and optimized. In an end-to-end system, the blocks are jointly optimized to achieve a single overarching goal. In a non-end-to-end system, each block is optimized individually, without consideration of the larger system's objectives.

Wait: Isn't it more of a Black Box now? How do we even validate this thing and put it on the road?!

Hey, I'm just the messenger here.

But yes, it can seem more of a Black Box, but you can also see how we're still using Occupancy Networks and Hydranets and all of these, we're just assembling the elements together. So, it's a Black Box, but we can also, at any point in time, visualize the output of Occupancy, visualize the output of Object Detection, visualize the output of Planning, etc...

We can also train these elements separately, and then finetune in an End-To-End way. So, it's a Black Box, but not necessarily more than it already was. There's just one additional level of complexity: the overall training.

For validation, and anything like this, I'm not the right person to ask. Tesla is not the only company doing End-To-End, there is also Comma.ai with OpenPilot, and Wayve.ai.

Now — something important notified by Elon Musk is that using an End-To-End approach, we no longer "tell" the vehicle to stop at red light, or at stop sign, or to verify xyz before changing lanes...

The vehicle figures it out on its own by "imitating" the drivers from the 10M videos they used. So, this means they've been using a dataset of 10M videos, they graded the drivers on each of these, and they trained the machine to imitate what the "good drivers" were doing.

This can be huge in theory, because it would mean the models could generalize much better when facing unknown scenarios — it would simply find the closest behavior to adopt in its training, rather than staying stuck.

Video of the Livestream from Elon Musk

You can watch the livestream where they first demo-ed it here. And yes, there is a disengagement towards 20:00.

https://t.co/VzTxpktH1q
— Elon Musk (@elonmusk) August 26, 2023

It's now time for a summary...

Summary

Up to now, Tesla is using a modular approach to autonomous driving, with 2 main blocks communicating together: Perception & Planning.
In 2021, they introduced the HydraNet, a multi-task learning architecture capable to solving many Perception tasks all at once.
They also announced their Planner to be an assembly of a Monte Carlo Tree Search, and a Neural Network.
In 2022, they added an Occupancy Network, which helps with better 3D understanding. The HydraNet also had an extension for lane line detection.
To transition from their current system to End-To-End, they will need to (1) turn the Planner into a Deep Learning system, and (2) train these with a joint loss function.
The system may look like a blacker box, but it's actually an assembly of existing blocks — they won't get rid of everything they've already built, simply glue them together.

Next Steps

If you liked this article, then I invite you to read these 3 related articles, on Tesla and architectures:

More?

📥

Did you arrive here randomly? If you leave before joining my email list, you won't read all my other posts, and miss all my daily emails...

If you want to stay aware of all the cutting-edge industry, I highly recommend you subscribe and join 10,000 cutting-edge engineers here.

Point Cloud Registration: Beyond the Iterative Closest Point Algorithm

Jeremy Cohen — Fri, 28 Jul 2023 13:11:07 GMT

I was walking in Paris, doing shopping with my girlfriend, when suddenly, I saw it. A giant, dark, and somewhat orange smoke invading the sky. People started looking up. What was happening? This is when I looked at a notification on my phone, and saw the news: Notre Dame was burning.

I couldn't believe what I was seeing, the nearly thousand years old cathedral was on fire, and even after hours, the fireman weren't able to make the chaos stop. Days passed, and it was seen as a national tragedy everywhere in France. That is, until the President declared that over 850M$ were donated by the French citizen, and this to help rebuild it.

But how do you rebuild Notre Dame exactly as it was? It turns out that an art professor named Andrew Tallon considered the need to make a digital copy of the cathedral back in 2010, and using a 3D laser scanner, he spent 5 days capturing the monument from 50 different angles, building a set of over 50 point cloud files.

But how did they combine all these point clouds into a single file of a billion points? It turns out, the engineers working on it were able to "align" all 50 point clouds together, and create one piece of digital copy that was accurately representing the monument using a technique called Point Cloud Registration.

The Notre Dame Point Cloud from @Andrew Tallon

Point Cloud Registration is the task of aligning two sets of points together, and it can be used in cases like this one, but also in autonomous driving, robotics, cinema, and many more...

In this article, I'm going to give you an overview of how it works, what are the different techniques, and how to use it to solve problems related to autonomous driving (my thing).

What is Point Cloud Registration?

Every time you have a point cloud, whether made by a LiDAR, an RGB-D camera, the Face ID scanner of your iPhone, or anything else, you're going to want to process it. Through this process, you will want to find objects in 3D, build a map of the world, combine sensors, and do all sorts of things with the point cloud data.

But what happens when you have multiple point clouds? What if you have, for example, two LiDARs mounted on top of a self-driving car, looking at a street? or what happens when you're doing SLAM, and want to implement loop closure?

The problem is known as point cloud registration, or point set registration, and this is going to be our topic today.

The Process if Aligning two clouds capturing l'Arc de Triomphe in Paris

How does Point Cloud Registration Work?

Imagine we have two point clouds we want to align. The real vocabulary we should use first is "registration". This means that we're going to take one point cloud, the master point cloud, and we will "register" the second point cloud to the first one.

First, I want to acknowledge that it's not "easy". It's much more than just combining two point clouds, because there is likely an overlap, some LiDARs may have more points than others, they're not even in the same coordinate system, and so this entire thing needs "techniques". If you want to have a robust registration process, you'll need more than just combining point clouds.

In reality, it's often the story of 2 processes: correspondence and transformation.

We have 2 steps! Correspondence & Transformation!

So given a point in a cloud, we want to find its corresponding point in the other cloud, and then estimate a transformation matrix to go from one point to another.

Correspondence

The idea of correspondence is, given a point in target point cloud A, what is the corresponding point in cloud B? You can spend a lifetime searching in a billion points, or you can go over various techniques that will calculate a distance between the point coordinates and pick the lowest cost. This can be an euclidean distance, a point-to-plane distance, or anything like that.

Notice, for example, how these two point clouds overlap:

If the points have no overlap, then the registration will likely fail. But there is probably a sub-category of algorithms just for this problem

This overlap is actually useful, because thanks to it, we will understand how our two point clouds work together. The first task is here, we want to estimate all the points that are redundant in both clouds.

Transformation

Once we have these points, we will want to do a transformation from one cloud to the other. A transformation is either:

A rigid transformation (Translation & Rotation)
A non-rigid transformation (Scaling, shear mapping, ...)

Every single point (that contains an equivalent) will be transformed using a rotation and a translation

So how do you do these 2 steps? How do you make sure the points are indeed the right ones, and the alignment is correct?

This is where 3 techniques used and documented:

Optimization Based Techniques
Feature Based Techniques
End-To-End (Deep Learning)

Ready?

Optimization Based Techniques for Point Cloud Alignment

The first category is based on optimization. Optimization means that we will do an iterative process, and that through this process, we will try several correspondences and transformations, and iteratively select the best one.

There are several algorithms to solve these two steps, from the super well known Iterative Closest Point (ICP), to Gaussian Mixture Models (GMM), to even Graph Based Approaches. But no matter the approach, we're always going in an iterative process:

Let's take, for example, the Iterative closest point (ICP).

Iterative Closest Point

The process is as follows:

For each point in the source point cloud, match the closest point in the reference point cloud (by doing an euclidean distance for example).
Estimate a transformation (rotation and translation) using a root mean square point to point distance metric minimization technique which will best align each source point to its match found in the previous step. This step may also involve weighting points and rejecting outliers prior to alignment.
Iterate (continue with other points)

Notice how this iterative closest point is doing nothing but doing an euclidean distance calculation, and then a transformation.

So this process is simple, and efficient, especially if you have the exact same two LiDARs, with similar point clouds, some overlap, etc...

Feature Based Techniques for robust point matching

The main issue with optimization based approaches is that they're going to rely solely on distances or "basic" information. This is why we could also use features. Rather than matching "points" based on their coordinates, we match them based on their features.

Now, "feature" based involves that a point cloud has features. And yes they do! Consider them, we have:

The XYZ Position of a point

But now think bigger, for a point, we can also have:

The color information, if from an RGB-D camera
The Intensity of the point, if provided by the LiDAR
The Reflectance Information, if provided by the LiDAR

And in reality, we can have many more. A point cloud can have many types of featurs, such as surface normals, corners, key points, etc... In my Point Clouds Conqueror course, I teach 12 types of features you can use in a point cloud, and show you techniques, like the FPFH (Fast Point Feature Histograms) algorithm, to calculate them. It looks like this:

Rather than comparing distances, we can compare features

There's also the Deep Learning route. In my Deep Point Clouds course, I teach Deep Learning processes to find features using neural networks. This would imply algorithms such as the "Deep Closest Point" (paper), because you're in the feature space.

So imagine we have these features, what then? Then, we do:

Correspondence: We calculate a distance between the features
Transformation: We estimate a translation & rotation to go from cloud A to cloud B. Algorithms like RANSAC could also be used here, since their job is, besides finding outliers, to find planes and their coordinates.

Finally:

End-To-End Learning for efficient point cloud registration

The main limitation of the first two point cloud registration methods is that they always rely on us calculating a distance between either the points or the features to do the work. With that, the transformation is done via another "manual" algorithm based on Root Mean Squared Error (RMSE).

But if we use Deep Learning techniques, we could also build a framework that, given two point clouds, automatically outputs the transformation parameters.

Like this:

Of course, people found a way to use Deep Learning here too...

Here, it looks like we ditched the correspondence step, but in reality, we used Deep Learning everywhere, and performed a regression of the transformation parameters directly after a feature extraction. It's a bit more "black box", but it works, and many algorithms have been created for this, and can work with multiple point clouds.

So you know, at least remotely, understand 3 approaches for point cloud registration.

Example 1: Zoox's Solid-State Fusion & Point Cloud Stitching

Did you ever see these mechanical LiDARs working on self-driving cars? At one point around 2017~2018, it seems like it was all I could see when viewing the CVPR conferences (conference on computer vision and pattern recognition), or when browsing self-driving car videos.

This LiDAR automatically provides a 360° view of the world. Yet, when looking at the self-driving car industry, you can see how it "shifted" to using solid-state LiDARs, that couldn't do a 360° view. So what did companies do? They started to use many of these LiDARs, and position them everywhere in cars.

So let's look at an example of two very popular self-driving car companies: Waymo & Zoox. The first one is using a 360° mechanical LiDAR, while the second one is using two solid-state LiDARs. Which of them do you think is going to need point cloud alignment?

The Problem of Point Cloud Stitching is present with Zoox

Obviously, while Waymo only has one point cloud, Zoox has 4 of them. Now Zoox says: "These LiDARs are complementary." You can look at my article on the Sensor Fusion types, and you will see that what they're doing involves a process called complementary fusion, and that this process means that we're adding several sensors together into a single output.

Notice the visualization here:

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

This is what Tesla uses with their surrounding cameras, and this is what you can use to build panoramic photos from multiple images. Whatever the case, they're using Point Cloud Registration to do this.

Example 2: Simultaneous Localization And Mapping with Loop Closure

Loop closure detection is another important type of applications. When doing SLAM, we sometimes visit places that we already saw. Loop closure helps "close the loop", and therefore not have duplicates in the map we're building.

Loop Closure Detection (source)

The process is explained in the article, but it's widely used in mapping, and even 3D Reconstruction!

For example, I purchased a Turtlebot 4 the other day, and using the LiDAR on my iPhone, I was able to create a 3D model of the robot. Well, I tried with and without loop closure, and note what happened:

3D Reconstruction of my Turtlebot with and without Loop Closure!

This is obviously done using Point Cloud Registration, and here I'm showing you a proof that it's needed, and that it works.

What if the Point Clouds are different?

There is one caveat to all of these approaches. They all belong to a family we could call "same source" point cloud registration. This means that in every case, it's the same LiDAR operating, or a very similar one.

But imagine we want to reconstruct Notre Dame. How do we compare the current point cloud with the original one, knowing that it was from a 3D scanner from 2010 that may not even exist anymore? How would the matching process work if you have LiDAR A producing 20,000 points and LiDAR B producing 1M points? Or what if you have an RGB-D Camera, coupled with a 3D LiDAR?

The problem known as "Cross-Source" Point Cloud Registration becomes much more complicated, because you can't really match a source to its nearest point, because there may be hundreds of nearest points.

Cross-Source Registration. (a) Kinect | (b) 3D Reconstruction on a camera (source)

Fortunately, there are options around optimization and deep learning based methods, such as CSGM , that transforms the registration problem into a graph matching problem; or as FMR that does it with Deep Learning.

So, you know this problem exist, and it's more challenging, but it also brings benefits, because it opens new use cases (like construction, comparison, etc...), and can also leverage multiple sensors, such as color cameras and LiDARs.

I think we have accumulated quite a bit of knowledge, so let's do a summary:

The Summary

Here are the main elements you should remember about point cloud registration.

Point Cloud Registration is the idea of aligning two or more point clouds together, to build one point cloud.
This process involves two steps: correspondence finding and transformation estimation.
There are 3 main families of algorithms existing: optimization based, feature based, and end-to-end based.
Optimization based approaches include the Iterative Closest Point (ICP), or Gaussian Mixture Models (GMM), and involve finding the closest point in two clouds, and calculating a transformation between the original point cloud and the target point cloud.
Feature Based approaches involve calculating point cloud features, such as keypoints, corners, colors, etc... and then do a distance calculation and a transformation based on these features.
End-To-End is the process of regressing the transformation parameters directly using Deep learning.
If the point clouds come from two different sensors, the problem is known as cross-source registration, and is usually solved by reframing the problem into something else (graph optimization for example).

See? We can reconstruct Notre Dame. And we can do many more things. Applications of this exist in autonomous driving, for example with Loop Closure in SLAM, in 3D Reconstruction, or even in Sensor Fusion.

So now, let's see the next steps:

Next Steps

If you want to learn more about how to use Point Clouds Registration, I would recommend taking a look at my Loop Closure Article here.
If you want to learn more about Sensor Fusion, I would take a look at my 9 Types of Sensor Fusion Algorithms here.

But the most recommended is below...

📥

If you want to learn more about SLAM, Robotics, Point Clouds, and AI, I highly recommend you read my other posts, and join my daily emails, where I often talk about LiDARs, Computer Vision, and more cutting-edge AI Applications. You can receive my emails here.

4 Pillars vs End To End: How to pick an autonomous vehicle architecture

Jeremy Cohen — Fri, 21 Jul 2023 10:37:39 GMT

In 2018, TV producer Jon Favreau, known for his role as "Happy" in the movie Iron Man, embarked on an ambitious project to produce the highly anticipated TV show, The Mandalorian, set to premiere on Disney+.

Everything was supposed to go smoothly, until the team encountered a significant issue on Day 1 of visual production. Indeed, they realized that The Mandalorian will be wearing a beskar armor, which contains a high level of reflections. Therefore, shooting the movie using green screens, which were the de-facto technique in filmmaking, would compromise realism, and show green reflections everywhere on the helmet.

This setback prompted Favreau and his team to seek an alternative solution: Virtual Sets; circular rooms where every wall and ceiling is made of a rotating LED screen. This innovative approach not only ensures that the projected scene contained the true colors and reflections but also proved to be a cost-effective and actor-friendly solution.

The Virtual Set of the Mandalorian provides a better immersion, lightning, and sets a new path to filmmaking

With the implementation of Virtual Sets, Favreau and his team created one of the best TV show in the Star Wars Universe, and popularized a new production technique that would be reused in other shows like Obi-Wan Kenobi. Today, it's said that LED screens are going to replace green screens in filmmaking. They provide a better immersion for the actors, but also give better lightning, realism, while being significantly cheaper.

Self-Driving Cars are very similar to filmmaking. They have the very traditional and universally approved 'green screen' technique. I call it the "4 Pillar" architecture — and a relatively new and 'LED screen' type of architecture called "End-To-End".

Over the years, we've seen companies debating over which architecture should be used, some claiming that End-To-End could replace the 4 Pillar architectures. In this article, I'm going to describe the two ways of building an autonomous vehicle system architecture, and give my opinion on the future of self-driving car software.

The 4 Pillars of Self-Driving Cars

If you have taken my course "THE SELF-DRIVING CAR ENGINEER SYSTEM", you know that this course introduces you to one type of architecture I call the 4 Pillars.

Here's how it works, at a simple level:

The 4 Pillars Architecture

So it's all very linear, and we have the output of one being the input of the next one. This architecture is probably the most universally known, accepted, and used for an autonomous driving system.

The 4 Pillars are:

Perception: We use a set of vehicle sensors, such as cameras, LiDARs, RADARs, ultrasonics, and more... and we perceive the world.
Localization: We take the output of Perception, a GPS, and a map, and localize ourselves in the world
Planning: From the obstacles around, and from our position, we plan a trajectory from A to B
Control: Using the trajectory information, and the vehicle parameters (weight, tire size, etc...), the control system generates a steering angle, and acceleration value.

This is how 99% of self-driving cars and autonomous robots drive. Is it not? The automotive industry has adopted this architecture, and when adding more details, it can look like this:

The 4 Pillars, zoomed in

And this is just the tip of the iceberg. When you "zoom in" again, you can see Computer Vision broken down to pedestrian detection, traffic lights, data management, ... and planning broken down to path planning, trajectory generation, etc...

Yet, you may have learned these 4 pillars in a different order, or with different names. The first time I learned about the 4 pillars, prediction was part of the "planning" step. But then, when I learned it from other sources, it belonged to Perception. So I placed it in Perception, because it made sense to me. I sometimes learned about 3 pillars, where "localization" belonged to Perception too, and sometimes, there was no "control".

So which one is right or wrong? I guess we'll never know. And it also depends on what companies want to implement. The only, real, always here, two steps are Perception (see the world) and Planning (plan a trajectory).

So, let's see some examples.

Examples of 4 Pillars Architectures

To begin, a warning.
The 4 Pillars aren't exactly the same to all autonomous vehicles. In fact, every company has developed its own "4 Pillars", and there's a lot of variation. I sometimes learned about 3 pillars, where "localization" belonged to Perception, and sometimes, there was no "control".

Let's first see Waymo, who popularized the 4 pillars in autonomous vehicles.

Waymo and the absence of Localization

A first example we could look at is Waymo. This company is one of the pioneers of self-driving cars, it's highly "classical" and traditional; and yet, it doesn't follow the 4 pillars (that they invented).

Waymo's 4 Pillars are a bit different, it looks like Localization is a solved problem

This is because Waymo evolved their development process significantly around "prediction", with many research papers being published since CVPR 2020 around this topic. In fact, at CVPR 2023, one of the 5 papers Waymo published was called MoDAR (Motion Forecasting for LiDAR), and another was was called Motion Diffuser (which involves using Diffusion to predict motion).

Apollo White and the 7 Pillars

In China, there's a company named Baidu that released an open source software named Apollo. The idea is, if you use the exact sensor configuration they give you, you can download their software components, and have access to fully autonomous vehicle.

Because it's all open source, so is their architecture. Here it is:

The Apollo Auto Architecture (source)

Notice how they have the 4 main pillars, but 3 others:

Map Engine
Prediction
HMI (Human Machine Interface)

They also have their own "ROS" middleware, called Cyber RT.

What's interesting is that this is version 8.0, but version 1.0 worked very differently: it used ROS, and didn't even have a Perception system. It was a simple... GPS Waypoint Following for parking lots!

The Apollo v1 (source)

Does naming matters?

You can see how Waymo is focused so much on Prediction that it even squeezes the idea of localization. Companies like Zoox follow this pattern. Some other companies like Cruise or Aurora will be super focus on Sensor Fusion, and therefore, they might have one of their pillar named after it.

In reality, this doesn't really matter. What matters is that the 4 Pillar architecture (or 3, or 5, or alternate 4), is implementing the modules block by block. A team is working on Perception, another on Localization, another on Planning, another on Controls, etc... as if they were independent systems.

On the other hand, there are companies who do it all-in-one. You have the sensor, then one algorithm, and you have your output. These are called the End-To-End drivers...

End-To-End Architectures

End-To-End architectures work differently: the main idea is to feed all your inputs to a single algorithm, that directly outputs the driving policy. So let's see what it looks like:

The End-To-End Architecture is much more intimidating... What's inside that box???

Of course, millions of questions raise immediately. What's inside that box? Does it really work? Is it really all Deep Learning??? Shouldn't the output be a bit different?

Let's first go back to the roots...

End-To-End Creed: Origins

The first time this type of architecture became popular, it was in 2016 when Nvidia released a paper called "End-To-End Deep Learning for self-driving cars". The paper was a 9 layer CNN that took one camera image as input, and output a steering angle and acceleration.

(source)

This type of project became insanely popular with students who all started their "self-driving car project" in the Udacity simulator; including me.

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

How the architecture evolved

Since the Nvidia paper, a few self-driving car startups (but just a few) have tried the end-to-end approach.

Comma.ai

Companies like Comma.ai started by claiming that the best thing to build was an end-to-end planner (e2e). In case you don't know them — comma.ai is selling a smartphone that can be plugged to your car and make it autonomous using the cameras. The approach is described here.

However, in their e2e planner, rather than outputting an angle and acceleration, they output a trajectory represented by x, y and z coordinates in meters — and this linked to a controller.

Comma.ai End To End Planner

To be clear, this isn't the same End-To-End as I described earlier... it's a hybrid! We still have blocks communicating together, like vision and planning, but the overall thing is trained end-to-end; meaning without much "rules" in place. Most of the training is done in simulators, and the idea is to then scale the learning to any road.

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

Next, let's see another approach... the Wayve!

Wayve & End-To-End Reinforcement Learning

There's a company in London called "Wayve", and this company creates and end-to-end algorithm based on Reinforcement Learning. The idea is incredibly complex, yet it can be summarized in one picture.

(source)

The architecture is called 'MILE' (Model-based Imitation LEarning), and here, you're looking at two observations o1 and o2, and an imagination o3 (which is a future prediction). Each observation is sent to an inference model e, which then goes to 3 models that do observation posterior, labels, and action.

I recommend to read the paper linked with the image, but the point is... when you train this whole thing on CARLA simulator, it generates this:

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

Incredible, isn't it?

Is End-to-End really End-To-End?

No. In fact, I would say besides the Nvidia model, they're all hybrid. They all, at some point, open the black box, include rules, break down blocks into smaller blocks, etc...

So this mean there are actually 3 categories:

4 Pillars
Pure End-To-End
Hybrid

Today, you will probably see most "pure end to end" approaches are research projects, self-racing cars, lane following robots, etc... But I suspect the goal of Hybrid E2E companies is to transition to pure E2E. On the other hand, any automated driving system, or robot, or something with multiple sensors will go 4 Pillars.

Conclusion: End-To-End vs 4 Pillars

Back when I started Think Autonomous, which teaches engineers how to build skills in self-driving cars and autonomous tech, I was told I should start a podcast. I was then told I should create a course on Udemy. I was also told to start a YouTube channel, to keep a self-driving car job, and to use TikTok.

I did none of those things. Instead, I started a daily newsletter, that is today read by over 10,000 people every day; and I later added a blog. I could easily have transitioned at the start, but I didn't. Instead, I made the email newsletter approach work.

You pick an approach, and you then make it work. Take for example Tesla and LiDARs. Originally, Tesla started by being against the use of LiDARs in autonomous driving. It was very hard to implement a Computer Vision only approach, but instead of going back, they "made it work".

How to know which one to pick? There are a few rules you can follow, such as:

If you don't have access to big simulation, go with the 4 Pillars.
If your products are made for environments where the clients want tons of robustness, safety, go with 4 pillars.
If your team has more expertise in Sensor Fusion, LiDARs, and Traditional Controllers, go with 4 pillars.
If your team doesn't involve many researchers, and it would be easier for you to use a predefined path, go with 4 pillars.

On the other hand, if you:

Have access to abundant data, processing power, or simulation
Have many Deep Learning researchers in the team
Are a bit of a contrarian, and are looking for a way to stand out

Then you could always explore end to end approaches.

So, now let's do a summary.

Summary & Next Steps

When building an self-driving car software architecture, there are usually two approaches to use: the 4 Pillars and End-To-End.
The 4 Pillars of autonomous vehicles are Perception, Localization, Planning, and Control. It all starts with Perception, which is about "seeing", and the data is then passed to localization, to trajectory planning, and finally to controllers.
Some vehicle manufacturers and startups will invent their own "4 Pillars" based on where they put their focus, like Waymo & Prediction. However, it's all the same idea, with blocks communicating with eachother.
End-to-End Learning is about feeding input data to a model that directly outputs a driving policy (3D Trajectory, or a duo throttle/steering angle). The approach was pioneered by Nvidia in 2016.
Companies like Wayve or Comma.ai work heavily on End-To-End Planning, and both approaches currently work — but being somewhat hybrid.
If you're hesitating between which approach to build, go with 4 Pillars; if you know, you know.

Next Steps

📥

Interested in self-driving cars? Well, do you remember this email list I was talking about earlier? It's still on... We've passed 1,000 daily emails, and my absolute best content goes there everyday. So if you're interested in joining, I highly recommend you subscribe and join 10,000 cutting-edge engineers here.

(Mindmap) A Hardcore Look at 9 types of LiDAR systems

Jeremy Cohen — Thu, 29 Jun 2023 15:23:43 GMT

In 1969, the world-class comicbook company DC Comics was in the red, on the fence of shutting down. It was so bad that its new corporate master nearly took the drastic, permanent action of shutting it down.

It seemed that after years of battle with Marvel, which involved ideas theft, poached employees, spies, and even price wars, the former 800-pound gorilla of the comicbook industry was losing millions and going nowhere.

In a 2000 interview, you can read this: "Marvel was doing very well. We knew it because DC — independent news — was handling Marvel at the time and their numbers were coming in. Marvel had books like Spider-Man coming in at 70, 80, even 85 percent sales. And we had books coming in at 40, 41, 42 percent. Something was wrong, and [DC's executives] didn't know how to fix it."

So why did DC Comics survive? The answer lies in the rivalry. There is never a single winner, and Marvel and DC being rivals also boosted and helped create an entire industry. Along the years, Marvel and DC conceived this industry, through wars, but also collaborations like Spiderman vs Superman, and even price agreements. This rivalry helped DC not only survive, but it forced it to innovate.

When looking at something, it's easy to see it as black and white. But when you look at things from another angle, you start seeing an entire spectrum. Would DC have survived all these years alone if Marvel didn't exist? Watching from this angle, the answer is... probably not.

It's similar with the LiDAR industry. In the recent years, we've seen a lot of talking around Light Detection And Ranging, but it's often been one direction: Solid-State LiDARs are replacing mechanical LiDARs. FMCW LiDAR technology is the future, and replaces the traditional Time Of Flight. LiDAR Point Clouds are processed with Deep Learning, and no longer with traditional approaches.

The reality is much more nuanced, and this is what I'd like to show you in this article. Just like for Marvel and DC, things aren't black or white. There's an entire spectrum, and we'll learn all about it in this article. In particular, you'll learn:

The different Scanning Methods of LiDAR instruments
The different Dimensions and resolutions a LiDAR can produce
The types of Modulation a LiDAR system can use to measures distances

So let's begin:

The LiDAR Types by Scanning Method

There aren't just 3, but I think 3 is a good start. We're going to see:

Mechanical Scanning (Rotating/Spinning)
Solid-State (fixed)
Flash

1) Mechanical Scanning —— Rotating

The most popular and widely adopted LiDAR has been for a long time the Velodyne HDL-64E, that uses a spinning 64-layer laser detector (64 horizontal planes) to cover a vertical angle of 27°. It spins at 360°, at a frequency of 10 to 30 Hz. This LiDAR can see at around 120 m range.

Wait, what?

Let's take it from the beginning. LiDAR technology uses light to measure the world and generate a point cloud. For that, a LiDAR typically has an emitter, that will send a light wave, and a receiver, that will receive the bounced wave.

But how do you get a 360° point cloud? Well, you do that by rotating your emitter super fast, in a 360° way, to get a Point Cloud.

A Mechanical, Rotating LiDAR on top of a roof provides a 360° view. The main advantage? You only need one LiDAR, not 20.

Did you see how fast it's going? The first time I saw this, I was amazed, and even more when you see how it uses this process to produce entire point clouds. We'll come to this in a minute, but if you arrived here and have no clue what a point cloud looks like...???... just go read this first.

You may have noticed, this LiDAR is moving, and this introduces a risk of failure, as well as a higher cost (back in 2017, these types LiDAR systems could be sold at 100k$). With that, they operate at 905 nm, which is good to see at long distances, but has a risk of damaging the human eyes.

Which is why we introduced a solid-state LiDAR systems, with no moving parts.

2) Solid-State lidar systems

In a rotating LiDAR, a laser beam is steered at 360° by a mechanical system. But in a solid-state LiDAR we don't rotate, so we must introduce a different way to cover the entire scene.

The most commons are:

MEMS (Micro-electromechanical system) — A MEMS based mirror moves to “scan” the environment. This is what the self-driving car industry uses in replacement of rotating LiDARs these days.
OPA (Optical Phased Array) — The truly solid-state solution; an array is adjusted and deliver light pauses to different directions. This is the least deployed technology, and hasn't been tested in self-driving cars at scale (to my knowledge).

Still, below is an image of the different types we just covered:

You can find a more hardcore explanation in this article where I found the image, but I think you can already get the main ideas.

In rotating systems, we have horizontals laser beams, and we rotate around a vertical axis to cover a 360° scene.
In MEMS (micro-electromechanical systems), we use mirrors to deflect light, and thus cover a bigger angle.
In Optical Phase Arrays (OPA), we have arrays of closely spaced (around 1µm) optical antennas that radiate light in a broad angular range. By adjusting the "phase" of the light, we can then control the direction of the beam and thus "fake" the rotation.

An Optical Phase Array LiDAR

A Note — While MEMS and OPA offer a solution to the rotating LiDAR, they can't cover a 360° view, which is why you'll often see self-driving cars with several LiDARs mounted on them.

This brings us to the last type of LiDAR, the Flash LiDAR.

3) Flash LiDAR systems

Finally, one of the most interesting types of LiDAR is the Flash LiDAR. In effect, it operates like a flash camera, by expanding a laser beam to illuminate the scene. Unlike a scanning LiDAR, we don't construct the point cloud point by point, nor layers by layers, but by generating the entire 3D point cloud in one flash.

It's basically functioning like a digital camera: the target scene is divided into squares (like pixels), but rather than measuring the light intensity, we measure the time of flight.

The idea is fantastic, but because of the nature of laser, it doesn't really work like a camera. The resolution is (today) very poor, and it doesn't really work at more than 10 meters.

Here is the result of a company named LeddarTech and their product (named Pixell): (source)

" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

The great advantage of flash LiDAR systems is their cost, they could cost around 100$. They can also see through fog, and are the base for many robotics applications.

We've seen the 3 most common types of LiDAR sensors. Most LiDARs will, in a hardware speaking way, belong to one of the 3 families. More exist, like "Detector Arrays" or AMCW cameras, but if we're talking about self-driving car LiDAR system as in self-driving cars, we can mainly see these 3.

Now let's see another way to classify LiDARs:

The Types of LiDAR based on Dimensionality

Here, it's very straightforward. When talking about dimensions, we can do 2D, 3D, and 4D.

Yes — it's that simple. But aren't all LiDARs 3D? We'll quickly explore 2 comparisons:

2D vs 3D LiDARs
3D vs 4D LiDARs

2D vs 3D LiDARs

In 2021, I was on Amazon, browsing for a LiDAR to purchase, when suddenly, after browsing for long minutes, I found the one. In a minute, I made up my mind to buy it. I smashed my credit card information, and the next day, I received a beautiful rotating RP-LiDAR A1.

Contrary to what you may believe, this LiDAR didn't cost me much. It wasn't 10, 15, or even 20,000$. It just cost me 100$. The RP-LiDAR A1 is one of the most affordable LiDARs on the market, and let me show you the main reason:

The RP-LiDAR A1 is definitely a 2D LiDAR

This is the output of the LiDAR. And as you may notice, it's very different from the LiDARs you get to see on LinkedIn or Twitter. Not necessarily because they're worse, but because they're 2D.

In terms of hardware, the LiDAR is made of an emitter laser diode, and a receiver photodiode; and each color represents a "layer". Hence, a 64 layer LiDAR sends 64 lines of photons.

LiDARs are partly defined by how many vertical laser beam they send. If they send one laser beam, like with this RP-LiDAR, they're 2D sensors. When I started my journey in the world of Perception, I was using IBEO Lux LiDARs, which had 4 laser beams, and that was also considered 2D.

See, for example, the output of it on a real self-driving car visualizer:

The output of a 4-Layer LiDAR still looks 2D, it's because 4 isn't enough to consider a sensor "3D"

On the other hand, when you see LiDARs with 64 or 128 laser beams, or 'layers', you can consider that they're generate sufficient information to appreciate the shape of an object.

From my Point Clouds Fast Course

So let's now see the difference between 3 and 4D.

3D vs 4D LiDARs

Technically, it would be wrong to call a LiDAR "4D". But recently, we've seen types of LiDARs that could estimate a point cloud, with, for every single point, the radial velocity.

This is often the case for Doppler LiDARs, and this implies that we not only get the 3D (X,Y,Z), but also the 4th dimension: Velocity.

Take, for example, this LiDAR from Aeva, and notice how the red cars are receeding, while the blue cars (on the left lane) are approaching:

Aeva's "4D" LiDAR (source)

So dimensionality was another type.

Now let's see the final way of classifying LiDARs.

The Types of LiDAR based on Modulation Techniques

If you've read terms like "Pulse Time Of Flight", or "FMCW", or "AMCW", it's all going to be here. Here, we're not looking at "how do we generate a point cloud?", nor at "how many layers do we have?", but we're looking at "how do we send waves and use them to measure distances?".

And we have again 3 most common way to do it:

Time Of Flight
AMCW — Amplitude Modulated Continuous Wave
FMCW — Frequency Modulated Continuous Wave

And again, there are others, but these are the 3 most common I want us to see. Starting with:

Time-Of-Flight LiDARs: How to measure ranges with lasers?

What is the distance between the Earth and the Moon? If you google it, you'll find the answer to be 384,400 km. This same question got asked over 2,000 years ago, and already had an answer. But it wasn't until 1969 that humanity had a clear and accurate answer to this question, through an experiment called the Lunar Laser Ranging experiment.

Here's how:

In 1969, the Apollo XI mission, led by Neil Armstrong, deployed a 46-cm square laser retroreflector array containing 100 corner cube reflectors. Okay — they deployed big mirrors on the Moon that could reflect light. And 2 weeks later, when they came back in the US, they sent a laser pulse through a 3-meter telescope, and targeted these reflectors on the moon.

They then realized that by measuring the time it took for the beam to perform the round trip, they could calculate the distance between the Earth and the Moon with an uncertainty of 25 centimeters.

This modern technique to calculating the Lunar Distance is exactly how today's LiDARs work.

In autonomous driving, LiDAR is often listed as one of the main sensors, among cameras, RADARs (radio detection and ranging), and SONARs (sound detection and ranging). While RADARs measure the world by sending radio waves, and sonars sending sound waves, LiDAR do it using light waves.

The main idea of a LiDAR instrument is that it's built to measure distance directly. How? They send a light wave to the world, and measure the time it takes to reflect and come back.

Time-of-Flight (ToF) LiDAR measures the time it takes for a laser pulse to travel to a target and back to the sensor. It calculates the distance based on the round-trip time-of-flight of the laser pulse. ToF LiDAR systems emit short pulses of laser light and measure the time it takes for the reflected light to return to the sensor. This method allows for direct distance measurement.

Here is a simplified illustration to see how it happens:

Time Of Flight LiDAR measurement

There is, of course, a formula, which is: distance = time of flight * speed of light/2.

And this formula is enough to accurately measure the distance. When I say accurately, I mean with an accuracy of 1 centimeter, and a range of 200-300 meters.

But there are also more sophisticated LiDARs, for example the AMCW LiDAR:

AMCW LiDAR (Amplitude Modulated Continuous Wave)

While ToF LiDARs were sending pulses and measuring the time it takes to come back, there are other types that, instead of sending pulses, send continuous signals.

For example, the AMCW LiDAR (Amplitude Modulated Continuous Wave LiDAR) operates using continuous wave laser emission, where the laser signal is modulated with a specific frequency or pattern. By analyzing the modulation of the reflected laser signal, AMCW LiDAR systems can determine the distance to the target. The measurement is based on the phase or frequency shift of the modulation pattern.

Example:

An AMCW LiDAR

Having a continuous signal, plus measuring the amplitude, could help us go through snow flakes, fog, and other problems common in LiDAR. We'll come back to it, but note how understanding optics can help you solve issues.

The formula to measure the range is here:

Rather than just c/2*𝛕, we have a complicated equation with Δɸ, π, and even the wave frequency. When we get deeper, we can clearly identify c, the speed of light, and the constant frequency of the wave f.

Finally, let's see another type named:

FMCW LiDAR (Frequency Modulated Continuous Wave)

I have an entire article on FMCW LiDARs (Frequently Modulated Continuous Wave). As a comparison to AMCW LiDAR, we don't modulate the amplitude, but the frequency.

FMCW LiDAR Functionment

Notice how it's a continuous signal, that has difference between two phases. It also has a formula, but it's a bit more complicated, and so I'll leave it outside of this article. These days, it's said that FMCW LiDARs could replace Time Of Flight LiDARs in the self-driving car space. It''s said that it can even perform 4D Localization without a GPS (Global Positioning System). While these may be true, we're not there yet, and an FMCW LiDAR usually performs at a short range (for now).

So we've done it! We've covered the main types of LiDAR! Yet, there's a question that may get asked:

Is that all? What about topographic, bathymetric, and airborne lidar systems?

We have seen several ways to classify LiDARs. Yet, when browsing for lidar types on Google, I noticed many people wrote about another totally different type of classification...

For example, what kind of LiDAR would we use to map the ocean? Or from a plane or satellite to map a country? The usual automotive LiDARs I describe in this article don't work, and this is mainly because we have types of LiDARs developed just for these use cases.

Mainly, we use topographic and bathymetric LiDARs. Both are mounted on planes or aircrafts, but while a topographic LiDAR (or any of the airborne lidar systems) will map the ground or land, a bathymetric LiDAR will penetrate the water and map below the oceans. You can imagine that, for these cases, we can't use short-range flash LiDARs — and we typically rely on solid-state or rotating.

0:00

Output of a Bathymetric LiDAR (source)

Now let's see some examples of LiDAR sensors.

Example: The Ouster that can see through fog...

Let me start this example with a story:

In January 2023, I met with LiDAR startup Ouster at the Consumer Electronics Show (CES) in Las Vegas. At some point, one of their representatives showed me a demo of their work on fog.

"Fog?" I asked. "Your LiDAR sensor can see through fog?"

I had no idea this was a thing, but apparently, "when you understand optics, you can do all sorts of things".

Intrigued, I got my phone out, and started recording...

0:00

The Ouster LiDAR Demo (recorded at CES 2023)

What do you notice? The ouster uses a special technique they call "Dual Return", which consists in sending 2 beams, to increase the chances of them passing.

Now some assignment: Go to Ouster's website, check for the "REV-6" product, and answer these 3 questions:

What is the scanning system?
What is the dimension? How many layers are there?
What is the type of modulation?

When wandering there, you will find things like this as well, they have nice illustrations of their features, and even visualizers, SDK, etc...

The ouster all-weather feature (source)

Summary

Let's now summarize what we learned in this article.

There are at least 3 ways to classify LiDARs: by scanning, by dimensionality, and by modulation.
When classifying by scanning — we usually have rotating, solid-state, and flash LiDARs.
Rotating LiDARs physically spin 360° to generate point clouds.
Solid-State LiDARs are usually of two types: MEMS and OPA. Most of the self-driving car industry has been using MEMs LiDARs for long, and this starts to change as new types of LiDARs get introduced.
Flash LiDARs are more recent, short range, and produce an "image" by illuminating the entire scene in one pass.
Dimensionality classification is more straightforward: 2D, 3D, or 4D. The difference between 2D and 3D lies in the number of horizontal layers, or laser beams — and the difference between 3D and 4D lies in the ability to measure velocities of points.
Finally, Modulation classification answers the question "how do we measure ranges?" — and this is usually done either using Time-Of-Flight principle (in most cases), AMCW (amplitude modulation), or FMCW (frequency modulation).

So let's see the mindmap summary of the LiDAR Types:

Conclusion: What's the best of all LiDAR Types?

Why do we have so many LiDAR companies out there? And why do all of them claim to be the best at what they do? And even then, why do nearly all of them successfully raise funds, and gets clients?

The answer lies in the spectrum. Just like there isn't one way to see the Marvel vs DC war, there isn't just one way to see the LiDAR ecosystem. Some engineers claim that their LiDAR has a "continuous waveform" (AMCW or FMCW), and thus can see through fog. Others will claim that "Dual Return" will allow this. Others will say it's flash technology.

And the reality? All can be right. LiDAR is a complex sensor, that became a multi-billion dollar industry on its own. You've here accomplished one strong step: understanding the lidar types.

What could be some next steps? I would recommend:

Read more about LiDARs: If you use the magnifying glass in the header of this website, you will be able to research articles. Try "LiDAR", and see which article triggers your attention. I have some on Sensor Fusion, LiDAR Detection, 4D LiDARs, and more.
A next step summary of this article would be this research paper a reader from my daily emails sent me.
Ultimately, I recommend to at some point learn about Point Clouds processing. This is something you can do in one of my courses, like this one.

Ok... Are you about to leave? Well, you shouldn't before I tell you this:

📨

If you want to learn more about LiDARs, and especially about Point Clouds — I'm talking about how to become a LiDAR Engineer — or, because the name is a bit too fancy, a Perception Engineer — though my private daily emails. These are emails I send everyday to an audience of 10,000+ Engineers, and that help engineers learn advanced technical content, through stories, mindmaps, and frequent tips.

Subscribe here and join 10,000+ Engineers!