Stereo Vision for Self-Driving Cars is becoming more and more popular these days. The field of Computer Vision grew intensely over the last decade, especially for obstacle detection and Computer Vision using Deep Learning.
Obstacle detection algorithms such as YOLO or RetinaNet provide 2D Bounding Boxes giving the obstacles’ position in an image using Bounding Boxes. Today, most object detection algorithms are based on monocular RGB cameras and cannot return the distance of each obstacle.
To return the distance of each obstacle, Engineers fuse the camera with LiDAR (Light Detection And Ranging) sensors that use lasers to return depth information. Outputs from Computer Vision and LiDAR are fused using Sensor Fusion.
The problem with this approach is the use of LiDAR, which is expensive. One useful trick engineers use is to align two cameras and use geometry to define the distance of each obstacle: We call that new setup a Pseudo-LiDAR.
Pseudo LiDAR leverages geometry to build a depth map and combines this with object detection to get the distance in 3D.
Depth estimation in 5 Steps
From 2 cameras, we can retrieve the distance of an object. This is the principle of triangulation, and this is the core geometry behind Stereo Vision. Here's how it works:
- Stereo Calibration - Retrieve the key parameters from the camera
- Epipolar Geometry - Define the 3D Geometry of our Setup
- Disparity Mapping - Compute the Disparity Map
- Depth Mapping - Compute a Depth Map
- Obstacle Distance Estimation - Find objects in 3D, and Match with the Depth Map
In this article, we'll learn how to do these 5 steps to build a 3D Object Detection Algorithm, the goal will be that for each object, we can estimate the X,Y,Z position.
1. Stereo Calibration - Intrinsic and Extrinsic Calibrations
When looking at any image on the internet, it's likely that the camera has been calibrated. Every camera needs calibration. Calibration means converting a 3D point (in the world) with [X,Y,Z] coordinates to a 2D Pixel with [X,Y] coordinates.
The output of this step is simple: we need the camera intrinsic and extrinsic parameters. These will be used later to retrieve the distance.
How are images created?
Cameras today use the Pinhole Camera model. The idea is to use a pinhole to let a small number of ray lights through the camera and thus get a clear image.
Today, cameras are using a lens, to zoom and get a better clarity. As you can see, the lens is located at a distance f to the sensor. This distance f is called the focal length.
A few lines ago, I mentioned that the goal of camera calibration is to find the intrinsic and extrinsic parameters. I also said that the goal of calibration is to help us take a 3D Point, and convert it into a Pixel, thus creating an image.
So, here's how camera calibration works in one image:
Extrinsic Calibration is the conversion from World Coordinates to Camera Coordinates. We're basically saying "Here's a point in 3D in a specific coordinate frame. What would be the coordinate of this point if we'd look from the camera frame?". A point in the world is rotated to the camera frame, and then translated to the camera position. The extrinsic parameters are called R (rotation matrix) and T (translation matrix).
Here is the formula:
Intrinsic Calibration is the conversion from Camera Coordinates to Pixel Coordinates. Once we have the point in 3D, we're using the intrinsic parameters to convert this 3D point into a Pixel. This is done using the focal length. The intrinsic parameter is a matrix we call K .
Here is the formula for world to image conversion:
K is the intrinsic matrix. It comprises f, the focal length and (u₀,v₀) is the optical center: these are the intrinsic parameters.
So, we now understand that given a point in the world, we can convert that to the camera frame using the extrinsic calibration, and then to a pixel using the intrinsic calibration.
Here is the final formula we use:
You can notice, the extrinsic matrix has been modified, this is because matrix multiplications need matrix shapes to match; and it wasn't the case. We thus moved to Homogeneous Coordinates. You can learn more about the full formula in my course on Stereo Vision.
Next, let's see how it works with OpenCV.
Camera Calibration: Stereo Vision & OpenCV
Generally, we use a checkerboard and automatic algorithms to perform it. When we do it, we tell the algorithm that a point in the checkerboard (ex: 0,0,0) corresponds to a pixel in the image (ex: 545, 343).
For that, we must take images of the checkerboard with the camera, and after some images and some points, a calibration algorithm will determine a calibration matrix for the camera by minimizing a least square loss.
Generally, calibration is necessary to remove image distortion. Pinhole camera models include a distortion, the “GoPro Effect”. To get a rectified image, a calibration is necessary. A distortion can be radial or tangential. Calibration helps to undistort an image.
In my course on Stereo Vision, I go through the fundamentals of calibration in mono and stereo mode. We also see how to undistort images, and how to compute additional matrices such as the Essential and Fundamental matrices, used for applications such as 3D Reconstruction. Here's the link to learn more.
For now, to learn more about calibration, follow this link .
👉 At the end of the calibration process, you have two rectified images, with the parameters K, R, and T:
2. Epipolar Geometry — Stereo Vision
Stereo Vision is about finding depth based on two images. Our eyes are similar to two cameras. Since they look at an image from different angles, they can compute the difference between the two points of view and establish a distance estimation.
In a stereo setup, we have two cameras, generally aligned on the same height. So, how can we use a setup and geometrically design a system?
How can stereo cameras estimate depth?
Imagine you have two cameras, a left and a right one. These two cameras are aligned in the same Y and Z axis. Basically, the only difference is their X value.
Now, have a look at the following stereo setup, with two cameras CL (camera left) and CR (camera right) looking at an obstacle O. With geometry, we’ll find its distance. .
Our goal is to estimate the Z value, the distance, for the O point (representing any pixel in an image).
- X is the alignment axis
- Y is the height
- Z is the depth
- xL corresponds the point in the left camera image. xR is the same for the right image.
- b is the baseline, it’s the distance between the two cameras.
Taking the left and the right camera respectively, we can get the two equations in this drawing using similar triangles.
If you apply Similar Triangles theorem, you’ll realize that we can arrive at two equations:
- For the left camera:
- For the right camera:
When we do the maths, we can quickly arrive to Z, and can even derive X and Y.
3. Stereo Disparity & Mapping
What is disparity?
Disparity is the difference in image location of the same 3D point from 2 different camera angles.
Concretely, if I take the side mirror on the left image at pixel (300, 175); I can find it on the right image at pixel (250, 175).
In this example, xL = 300 and xR = 250. The disparity is called xL-xR; or here 50 pixels. It is estimated by sending two images to a function.
👉 Thanks to stereo vision, we can estimate the depth of any object, assuming we do the correct matrix calibration.
The formula is as follows:
Compute this for every pixel, and you get a disparity map! As you can see, close objects are lighter than far away objects that are represented in darker colors. We already have a sense of depth!
Why “epipolar geometry” ?
To compute the disparity, we must find every pixel from the left image and match it to every pixel in the right image. This is called Stereo Matching.
To solve this problem —
- Take a pixel in the left image
- Now, to find this pixel in the right image, simply search it on the epipolar line. There is no need for a 2D search, the point should be located on this line and the search is narrowed to 1D.
As in the example above, the mirror is on the same height because of the stereo calibration and rectification. We have a 1 dimensional search only. This is because the cameras are aligned along the same axis.
The Epipolar Line
This is because the cameras are aligned along the same axis. How does Stereo Matching work?
The correspondence problem, also known as epipolar search, can be done in many different ways:
- Using Local Approaches
- Using Global Approaches
- Using Semi-Global Approaches
- Using Deep Learning
OpenCV basic functions can solve the issue, but will be less precise than current Deep Learning approaches. In my course MASTER STEREO VISION , we learn how to apply this problem and we read tons of research papers to understand how to do stereo matching properly.
Now, given our two initial images, here's what we've got:
4. Stereo Vision - From Disparity to Depth Maps
👉 We have two disparity maps, that tells us basically what is the shift in pixels between two images. We also have, for each camera, a projection matrix : P_left and P_right.
In order to estimate the depth, we’ll need to estimate K, R, and t.
Based on the formula:
An OpenCV function called cv2.decomposeProjectionMatrix() can do that and get K, R, and t from P; for each camera.
It is now time to generate the depth map.
The depth map will tell us the distance of each pixel in an image, using the other image and the disparity map.
The process is the following:
- Get the focal length 𝑓 from the 𝐾 matrix
- Compute the baseline 𝑏 using corresponding values from the translation vectors 𝑡
- Compute depth map of the image using our formula from before and the calculated disparity map d:
Stereo Vision formula
We do that computation for each pixel, and we get a depth map!
5. Estimating depth of an obstacle
We have a depth map for each camera! Now, imagine we combine this with an obstacle detection algorithm such as YOLO. Such algorithms will return, for every obstacle, a bounding box with 4 numbers: [x1; y1; x2; y2]. These numbers represent the coordinates of the upper left point and the bottom right point of the box.
We can run this algorithm on the left image for example, and then use the left depth map.
Now, in that bounding box, we can take the closest point. We know it, because we know the distance of every single point of the image thanks to the depth map. The first point in the bounding box will be our distance to the obstacle.
Boum! We just built a pseudo-LIDAR!
Thanks to stereo vision, we know not only the obstacles in the image, but also their distance to us! This obstacle is 22.75 meters away from us!
Stereo Vision Applications - How do you use Stereo Vision?
Stereo Vision is something that turns 2D Obstacle Detection into 3D Obstacle Detection using simple geometry and one additional camera. Today, most emerging “edge” platforms consider Stereo Vision, such as the new Open CV AI Kit or integrations on Raspberry and Nvidia Jetson cards.
The simple and elegant thing is that you can get started in Stereo Vision using OpenCV... and then get much deeper by adding Neural Networks. Since it's mostly geometry, the only place where we'll use Neural Networks is in the disparity search (Step 3).
In terms of costs, it stays relatively cheap compared to using a LiDAR and still offers great performances. We call that “pseudo-LiDAR” because it can replace LiDAR in its functions: Detect obstacles, classify them, and localize them in 3D.
It doesn't stop there. In the Stereo Vision course , we also do much more advanced things such as 3D Reconstruction. If we have the distance of every pixel, we can recreate a 3D Point Cloud, as follows:
Stereo Vision is also a good alternative to LiDAR. In my article Stereo Vision vs Sensor Fusion: Which approach is better? , I compare how I use both approaches, and the results!