IF YOU'RE A COMPUTER VISION ENGINEER, I WANT TO SHOW YOU HOW YOU CAN...

BECOME A NEXT-GEN COMPUTER VISION ENGINEER

349€ or 2 payments of 175€

This year, I am bringing back the infamous BLACK SEASONS, where you cutting-edge engineers will be able to build MASSIVE skills in autonomous tech. And in particular, this week, I'm launching my brand new VIDEO PERCEPTION COURSE, in which you'll learn to become a...

"Next-Gen Computer Vision Engineer"

Mhh..Okay.

So what is a "Next-Gen" Computer Vision Engineer?

Well, the best way to explain is to show you with job offers mentioning "Computer Vision"​ and who are from companies really into building the future.

Take this first one from Facebook AI Research (FAIR).​

​Did you see how the Computer Vision Engineer job requires you to know how to work with semantics of data, including images, video, text, audio, speech, and other modalities? 

Isn't it surprising that a Computer Vision Engineer is required to know how to process text and audio?

It makes sense, does it? 

Now, ​take this other job offer at Apple:

"Want to ship amazing experiences in Apple products? Be part of the team in the Video Computer Vision (VCV) organization..."

​Now, there is a Video Computer Vision Team at Apple
!
​​
​But this isn't even what's surprising, what is surprising is that...

The Job Title doesn't mention video!

Just Computer Vision. Like the previous one.

​And just like these 3 below:

Computer Vision at Senseye (industrial computer vision) requires deep expertise in computer vision, particularly with video or camera-based systems.

Computer Vision at DreamVu (Industrial 3D Vision) demands expertise in topics related to 3D and Optical Flow/Motion Tracking.

Computer Vision job at Waymo (robotaxis) mentions at line #1 'multimodal models' - models that can process text, videos, audio, lidar points, ...

Now you see what I mean, right? The Computer Vision Engineer Job is evolving.

It's no longer 2017. None of these jobs mention the ability to process images with OpenCV. This has become so much of a basic fundamental that the leading computer vision companies leave it out. Some basic applications like object detection are also left out, and I predict that in a near future, most computer vision engineers will be processing multiple types of data... video being the most important.

With this, you'll notice that the use of videos here is NOT exclusively reserved to "video" edge cases, like retail analysis, video sports analysis, people tracking, and so on... It used to be the case, in the 2017s - we used to have video use cases, but in the near future, videos will be used for even very common use cases.

This is partly why VIDEO PERCEPTION is not a general multimodal course, but rather a course about the heart of Computer Vision: Videos. 

And when you pay close attention not only to job offers, but also to startups architectures, you'll notice they ALL switched to videos.

For example, here are 4 leading startups in the self-driving car industry, and recently published architectures:

Waymo, Wayve, Nvidia and Tesla show video first architectures

Today's algorithms are video first. If, as a computer vision engineer, you fail to understand videos, sequences, spatio-temporal fusion, and all at best stick to "frame-per-frame" tracking, you risk missing out on building the future. 

So now that you understand that (1) video is the future of computer vision, and (2) most leading companies are using it, and (3) that video is far from being just a few edge cases like retail or sports analysis, but taking over the entire world...

Let me show you the course I am launching this Black Seasons:

​Meet...

LEARN VIDEO PERCEPTION

Build Next-Gen Computer Vision Skills

  • Advanced Knowledge: Learn Next-Gen Computer Vision, using Video Transformers, Text Processing, Foundation Models, and more...
  • Cutting-Edge Projects: Build & Train your own vision systems on motion detection, action classification, and video QA.
  • Research: Understand how to tie your skills to Visual SLAM, End-To-End Learning, and many other advanced autonomous driving research skills.

349€ or 2 payments of 175€

This course is made in 3 modules, let's take a look at what's inside each of these...

MODULE I

Optical Flow & Video Analysis

In the first module, you'll work on video analysis, and in particular in the context of motion. This is everything from frame per frame difference, to Optical Flow, to motion histograms, to more robust activity detection techniques. 

What's included:

  • 5 ways video perception engineers use time as a source of information for advanced applications (none of them involves tracking, which would be an extra technique)
  • An introduction to Optical Flow, and the real reason why it's used by companies like Tesla or in certain SLAM systems (Hint: it’s not for “tracking objects” or “detecting motion” It solves a different problem entirely)
  • 3 adjustments you must do to Optical Flow algorithms in order to use optical flow for speed estimation (Most flow-based speed estimation demo you’ve seen online is wrong. We'll see how self-driving teams actually do it)
  • Why most Visual SLAM and Odometry algorithms rely on depth + feature tracking instead of optical flow (There’s one thing SLAM needs that flow will never provide)
  • Mini-Project: Build your own Feature Tracking & Optical Flow system on real self-driving car footage (Waymo)
  • What self-driving systems do to extend optical flow into perception tasks (This is where flow stops being a visualization, and starts being a signal used for decision-making)
  • Advanced Project: Implement 2 action detection algorithms using no Deep Learning, Object Detection, or heavy algorithm

Let's take a break here:​

An example of what you'll build is "Activity Detection", where you'll analyze when and where an activity happens in a video sequence.

So this is Module 1, after which you'll have very good understanding of motion, frame-per-frame processing, window processing, event detections, and more...

Next, let's see Module 2:

MODULE II

Action Classification with
Video Transformers

In the second module, we'll talk about action classification. One of the best thing you can do in computer vision is to classify videos or sequences, live. In this module, you'll learn how to use everything Deep Learning, from CNNs+RNNs (2014), to advanced Video Transformers from 2025 to classify videos.

What's included:

  • Why computer vision engineers don't feed images directly to a Transformer (one operation should always come first, many self-taught don't do it - but all the industry applies it)
  • Why I was told NOT to use 3D CNNs for video classification back in 2019, and why researchers still avoid them nowadays (actually the 3D Convolution itself is NOT a bad idea and can be used with Video Transformers, we'll see more details...)
  • The 'Slow-Fast' approach to process videos using CNNs (this 2018 idea is so good I recently discovered it got implemented in a 2024 State-Of-The-Art approach for Video Understanding)
  • The Vision Transformers 101, and advanced, including understanding of Multi-Head Attention, Positional Encoding, Embeddings (we'll also see insider knowledge, such as how many layers a transformer should have, and more...)
  • How to turn an image transformer into a video transformer
  • Shoplifting Detection Project: Build a transformer for action detection in surveillance videos (you will build it nearly from scratch, and implement a lot of interesting ideas, such as heatmap visualization)
  • The 3 types of Spatio-Temporal fusion a Video Transformer should use, and how to implement them (the idea of space and time is so important nowadays we'll spend a lot of time on it)
  • A deep dive into Tesla's Video Transformers for Occupancy Flow Prediction
  • And many more...

MODULE III

Video Foundation Models &
Advanced Video Perception

In module 3, we'll study all possible types of video processing, and then work with foundation models. You will learn how to build VideoQA solutions like Wayve's LINGO1, and understand the core blocks of AV 3.0.​

What you'll learn:

  • The 4 Pillars of AV 3.0 (according to Nvidia's head of AI research) and why it wouldn't be possible without video perception
  • Why nobody uses forecasting from videos (and what they use instead)
  • What qualifies a Foundation Model in the robotics space (and when to use general purpose algorithms versus more specific networks)
  • How to 'plug' LLMs to End-to-End self-driving architectures and build applications such as Wayve's LINGO QA.
  • How to engineer your own perception→reasoning systems using LLMs, Transformers, and Foundation Models
  • How Video Generation models like SORA work, and how they're used in self-driving cars
  • and many more... including how modern video algorithms work for intent prediction, diffusion pipelines, video segmentation, and more...

A Sneak Peek Inside the Course

Frequently Asked Questions

Who is this course for and not for?

This course is not only advanced, but it's also not for everyone. So let me help you decide.

  • If you don't validate any of the the mandatory prerequisites, you're too early
  • If you believe video = more tasks, you're fine but need to re-read the introduction of this page.
  • If you judge a course by its duration, wrong place. I do short and straight to the point courses, so you get away with skills as fast as possible 
  • If you have 0 computer vision experience, I recommend to work on it first.

Sounds good? Okay, so this means that...

  • If you're already a Computer Vision Engineer, and you want to be aware of the latest technologies, this course is for you.
  • If you want want to be part of the elite of computer vision engineers working on next-gen products, this course is for you.
  • If you believe that continuously improving yourself can transform your life, this course is for you.

What is the format?

This is a self-study online course, which contains videos, articles, drawings, paper analysis, code, projects, and more...

How long does the course take to complete?

The course is estimated between ~5-7 hours, depending on whether you just want to watch the content or do the projects as well.

Any prerequisites before joining?

Yes, mostly basic Computer Vision, including:

  • Coding in Python
  • Deep Learning Basics with PyTorch (Backpropagation, MLPs, Optimizers, ...)
  • Convolutional Neural Networks, OpenCV, ...
  • High-School Maths (derivatives, sin/cos, ...)
  • Nice to Have: 'Beginner+' Computer Vision (Depth Estimation, Segmentation, Object Detection, ...), End-To-End Learning

What if I'm stuck?

Our Think Autonomous 2.0 platform is optimized for collaboration, chat, support, answers, and community learning. In fact, some assignments will be done this way!

THE FINISH LINE

STUDENTS OF THINK AUTONOMOUS SAY

SAMPLE REVIEWS FROM THE OPTICAL FLOW COURSE [2022-2024]

Abdelrahman Ahmad, Computer Vision Researcher Engineer

Isaac Berrios, Electrical Engineer @Boeing

Aman Vyas, Master's Student in robotics and autonomous system in university of Turki

Why is this course Unique? ☄️

This course is unique because it's next-gen.

While most computer vision courses do a walkthrough of the fundamentals of Computer Vision, and dive into approaches that do not work today... this course focuses on the future.

As an engineer, if you are trying to build 'next-gen' skills, knowing exactly what is going on in the research field and inside self-driving car startups brings an element of attractivity most don't have.

pexels-pixabay-247786_.jpg

LEARN VIDEO PERCEPTION

Build Next-Gen Computer Vision Skills

  • Advanced Knowledge: Learn Next-Gen Computer Vision, using Video Transformers, Text Processing, Foundation Models, and more...
  • Cutting-Edge Projects: Build & Train your own vision systems on motion detection, action classification, and video QA.
  • Research: Understand how to tie your skills to Visual SLAM, End-To-End Learning, and many other advanced autonomous driving research skills.

349€ or 2 payments of 175€

z-logo-charcoal.svg

© Copyright 2025 Think Autonomous™. All Rights Reserved.