Venture Step

Meta's CoTracker 3: A Leap in AI Object Tracking

Meta's CoTracker 3 simplifies complex object tracking for robotics, VR, and more with unprecedented efficiency. Listen to the full episode to learn more.

Dalton Anderson

10 Dec 2024 • 11 min read

TL;DR

Meta's CoTracker 3 achieves state-of-the-art object tracking using 1000x less training data. By leveraging knowledge distillation and real-world videos, it's paving the way for advancements in robotics, autonomous driving, and VR. #CoTracker3 #AI #VentureStep

INTRODUCTION

Imagine a world where robots can seamlessly navigate a busy street, where virtual reality feels indistinguishable from the real world, and where your favorite AR filter perfectly tracks your every move. The foundational technology behind these innovations is point tracking—the ability to identify and follow objects frame-by-frame in a video. While crucial, this technology has historically been complex, resource-intensive, and limited by a lack of real-world training data.

In this episode of Venture Step, host Dalton Anderson dives deep into Meta AI's groundbreaking research paper on CoTracker 3, a new model that is revolutionizing the field. Dalton breaks down the evolution of tracking models, from early deep learning approaches like PIPS and TAPIR to the new, simplified architecture of CoTracker 3. This isn't just an incremental improvement; it's a fundamental shift in how tracking models are built and trained.

Dalton explores the core challenges of tracking—like objects being hidden from view or changing appearance—and explains how CoTracker 3's clever use of joint tracking and knowledge distillation overcomes them. The result is a model that is not only more accurate but astonishingly more efficient, achieving superior performance while training on 1000 times less data than its predecessors. This leap forward has profound implications for autonomous driving, robotics, video editing, and beyond.

KEY TAKEAWAYS

Radical Efficiency: CoTracker 3 achieves state-of-the-art results while training on just 15,000 videos, a staggering 1000x fewer than the 15 million videos required by competitor models.
Smarter Training with Knowledge Distillation: Instead of relying on limited and expensive synthetic data, CoTracker 3 uses a "student-teacher" method. Multiple pre-trained "teacher" models label real-world videos, allowing the "student" model (CoTracker 3) to learn from their combined strengths and become more generalized.
Human-Like Vision: The model uses "joint tracking" to understand that different points on a single object (like the wheels and roof of a car) are part of one structure and must move together, drastically improving tracking when objects are partially hidden.
The Power of 4D Correlation: CoTracker 3 goes beyond 3D (height, width, time) by adding a fourth dimension: the correlation between different tracks. This allows it to better understand motion patterns over time.
Broad Real-World Impact: This technology is a critical building block for the next generation of autonomous driving, advanced robotics, immersive VR/AR experiences, and sophisticated video special effects.

FULL CONVERSATION

What Is Point Tracking?

Dalton: Welcome to the Venture Step podcast where we discuss entrepreneurship, industry trends, and the occasional book review. Imagine a world where we could seamlessly track objects. Why do you think that's important? Think robotics. Think about your favorite dog emoji filter on Snapchat and Instagram. In this episode, we're gonna be discussing Meta's CoTracker 3 research paper. Two episodes ago, I talked about Meta's shipping spree. Within that episode, I had a live demo of CoTracker 3 with their online model. If you're interested, please check out that episode. In this episode, we're gonna be diving into their research paper and discussing some of the things that they're solving and why it's pretty cool.

Dalton: The first thing I'd like to touch on is what is point tracking? I talked about it two episodes ago. We're going to touch on it again.

Point tracking involves what it sounds like tracking an object within a video or an image and understanding its interactions frame by frame.

Dalton: What does that mean? Well, in the live demo, what CoTracker did was attract a man walking down a street. And the way that it attracted this person was it put down a grid of points pixel by pixel frame by frame on the video. Within that video, when the object moved around, there were tracks. It identified points on the subject that we wanted to follow and link to the subject. So there would be a couple of points on the body, the legs, the knees, the elbows, the hands, the head. And so as that person walked in the video, the point tracking would track that object.

Why Is Tracking So Important?

Dalton: So why is that important? Well, think about what is called 3D reconstruction. That is a basic and crucial application for autonomous driving, virtual reality, and medical imaging. Think about video editing and special effects, like tracking a subject and providing that subject with super armor or whatever powers they have and linking those powers with their hand. Motion analysis, understanding players on a field or different things that they're doing. All that stuff involves tracking.

The Evolution of Tracking Models

Dalton: Now that we understand the crucial applications of tracking, let's talk about how this has been evolving. So there are three models that are going to be discussed. The first one is the model called PIPS, that led to the evolution of tracking. So PIP is partial video revisited tracking through occlusions using point trajectories. And this was the first model that used deep learning for tracking. And then someone built on top of that.

Dalton: So another group of people came out with TAPIR, which is T-A-P-I-R, which stands for Tracking Any Point With Per Frame Initiation and Initialization, And temporal refinement. This tracking any point with per frame, what is that? So it's a bit different than the initial approach but builds off of the same concept. Instead of doing basic tracking, it does global tracking, or global matching is what they're calling it, to improve tracking accuracy. One of the key issues is that it is quite resource intensive.

Dalton: And now there is what is called CoTracker, which is what this episode is about. So CoTracker 3 incorporates correlation between different tracks to improve performance, particularly when an object is occluded, which is when it's moved out of the frame.

Key Challenges in Object Tracking

Dalton: Okay, so what are some of the challenges with tracking? We have occlusions, which is when an object becomes temporarily hidden. Complex motions, think about rapid or erratic movements. And then we have changing appearances. If it changed or there's some kind of lighting or object deformation, like maybe the object was in a crash and had a lot of kinetic energy exerted onto it and it's deformed. Or say that I turned up the lights or turned down the lights, the exposure and the way that I look and how I appear on the video will drastically change depending on the lighting. So those are the kind of things that trackers have trouble with.

How CoTracker 3 Solves These Challenges

Dalton: How do they deal with it with CoTracker 3? They use joint tracking. So joint tracking uses multiple points together and the thought process on this is pretty simple. CoTracker is trying to digitalize vision as I discussed in the last episode with Meta's Sparsh, which is the ecosystem for robotics to digitalize touch. Think about how vision works in your body. Joint tracking, what does that really mean? So if you were looking at a car going down the road, you could identify separate points of the car that are critical to the structure. You can think about the hood, the roof, the headlights, the wheels, all of those things are dependent on the structure and they have to move together because they're part of the structure. So what joint tracking is, is okay, I have the car and the car's hood and the roof, those are kind of together. So I'm going to track those things together and they're grouped. With joint tracking, it has some logic baked in saying, hey, this is one structure, let's track them together.

Dalton: And then there's this 4D correlation features, which I found pretty confusing.

Videos are 2D. So how does a 2D video that films a 3D world become 4D? And what they're saying is it's the height and the width and time, which is 3D. And then 4D is correlation between the different tracks.

Dalton: It compares the features that are identified of an object and then extracts them at different frames and creates this motion of patterns and then those patterns are part of this four dimension. And then there's iterative updates. So the way that this architecture is created is that they have the joint tracking, the 4D correlation features, and these iterative updates. So the iterative updates change the correlation over time while the video is being processed. The transformer is constantly updating and it uses things like object texture, speed, motion, how visible is the object? When is the object not visible or does it change directions? And it improves the estimates because it's constantly updating upon itself in real time.

A Smarter Training Approach: Knowledge Distillation

Dalton: Okay, so now we're gonna move over to how CoTracker works. What are some of the key differences between CoTracker 3 and these other models? The first one I think is crucial to identify is with these tracking videos, they're not very generalized and it relies on static data because it's quite expensive to label these data sets of videos that allow for training. It was so expensive that companies chose to build in-house models and in-house datasets and they weren't being shared. Not as much of an issue here as the previous one because I think there's less direct commercial application, but a similar issue with the synthetic data piece where there's not enough training videos that are labeled.

Dalton: So they use knowledge distillation, which is when you have multiple teachers in this case. These teachers are already pre-trained models on trained data. The teachers are labeling the real-world videos instead of using synthetic data. And the issue is synthetic data is not generalized, it doesn't model off the real world, and it's expensive to create. That being said, Co-Tracker 3 doesn't use synthetic videos, it uses real videos. So you have the teachers providing the labeled videos, they're providing the output to the student model, which is what becomes of CoTracker 3.

Dalton: I would think about it as an analogy where a teacher has different strengths and weaknesses. Like think about your math teacher, your history teacher, your science teacher. They all have different personality characteristics. What the student's job is to do is one, take the results from the teacher, study, figure out how to get similar results. And two, take the strengths of each teacher and then embody that.

Eventually the student is more competent than the teachers.

Dalton: So that's a great approach to strengthen the model and not only does it strengthen the model, it allows the model to use real-world data.

The Power of Efficiency: 1000x Less Data

Dalton: And then another cool thing is CoTracker 3 with this simplified architecture and approach is no longer as resource intensive.

Now it's able to get similar results with the competitor models with training on a thousand X less data.

Dalton: And the competitor model trained on 15 million videos. Co-Tracker 3 trained on 15,000. Big difference, better performance, more efficient, and real-world data, which makes it more generalized. So they're solving quite a few things there. They simplified the architecture. They allowed the processing of real-world data. They use knowledge distillation to allow the model to pick up on things a bit faster than you would normally. The student is taking all the pros, none of the cons.

Under the Hood: Technical Aspects of CoTracker 3

Dalton: There are some other pieces that I found quite interesting. The first one is feature mapping. So they use CNNs, which is Convolutional Neural Networks. And a CNN is a spatial model that understands and vectorizes images. It comes with what I would think about as maybe like a magnifying glass. It has multiple layers. The first magnifying glass might be very close and fine-grained, so it's only checking for different pixel colors, edges, and textures. Then the next layer will go over and maybe the magnifying glass is zoomed out a bit and you can see a bigger picture. And from there, you can see this is an object, this is a blue hat or this is a blue car. And then the third one might be a little bit more detailed. That's feature mapping.

Dalton: Another technical aspect that I thought was interesting was what's called downsampling. Downsampling is the approach of having an image and decreasing the pixels of the image. Why would you want to do that? Because then it would become blurry and things are less clear. You do that because it's faster to process. So downsampling allows the model to understand roughly what's going on and then it's able to process more information than it normally would be able to. It just basically reduces the resolution of a frame within a video.

Dalton: And the next thing would be multi-scale features. The multi-scale features is like different zoom points of looking at a frame. Zooming out and zooming in to understand the different nuances. If you look at a photo or a piece of art very close up, you can see every minute detail. Then when you step five steps back, you'll see something that you couldn't see before because you have a different perspective. And then you step 15 steps back and you might see something that people have never seen before. It's kind of a similar thing.

Real-World Applications: From Robotics to Hollywood

Dalton: So where does that leave applications? Well, as I mentioned earlier, 3D tracking and 3D reconstruction for robotics, autonomous driving, all those really cool applications that are going to revolutionize society, they're important. You need to be able to track objects that are around you. You need to be able to predict where objects might go in an instant. For you to be able to have autonomous driving, there needs to be events tracking, and not only tracking of moving objects but stationary ones like a stop sign or a pedestrian or a bird.

If you can't track, you can't drive is what I say to the autonomous vehicle.

Dalton: Then there's video generation and editing. I mentioned earlier with the special effects, like generating these special effects. Say that I have some hand movement and I shoot out lightning from my hand. How do you track my arm movement and my hand? And maybe with my muscle flexing, that's when the lightning's supposed to shoot out. There are different levels of detail you can be able to get if you had more advanced tracking.

Dalton: And as I mentioned before, robotics. For a robot to understand how things are interacting with them and their reality, they need to also understand reality itself. That means mimicking a lot of these human functions that we have that are related to thought, vision, sight, touch, hearing. If you can see and understand touch, you're pretty far along. If you could figure out the very complex nuances between vision, tracking objects within your vision, and then how to interact with objects via touch, you're pretty far along on just being able to assign tasks to robots and have robots working independently.

The Future of Tracking and Its Ethical Considerations

Dalton: Last thing would be sports analysis. Understanding what is interacting with the ball or the objects that you're tracking, which were humans playing on a field. I was mostly thinking about coaching and being able to have a professional trainer provide in-depth details of what a player is doing and how they could do things better.

Dalton: The future of tracking, this point tracking feature, it's gonna be great for robotics and potentially help shape some of the future research that comes out. One of the key concerns of this technology might be with tracking people. What about tracking them in a military aspect where you're using it for military applications of tracking vehicles and then launching munitions like missiles towards the object that you're tracking?

...using this device to assassinate individuals with drone strikes, like autonomous drone strikes.

Dalton: You're not having someone drive the drone anymore, you're just identifying, okay, here's a picture of this person that I wanna assassinate. Find them, track them, and then find the most optimal place to perform this deed. It's kind of this sleeper cell drone thing from Skynet that just takes over. I hope that we don't go that route, but humans left up to their devices typically choose destruction before peace. So we'll see how it plays out.

Dalton: But anyways, I'm super excited about this technology and how it's going to move the world forward. And I hope that you are too. If you are, share a little bit in the comments what you think. And if you found this episode interesting, of course, let me know.

RESOURCES MENTIONED

CoTracker 3 (Meta Research Paper)
PIPS (Partial video revisited tracking through occlusions using point trajectories)
TAPIR (Tracking Any Point With Per Frame Initiation and Initialization, And temporal refinement)
Meta's Sparsh
MovieGen (Meta Research Paper)
Hurting the Lamas paper
Iowa paper

INDEX OF CONCEPTS

CoTracker 3, Meta, Point Tracking, 3D Reconstruction, Autonomous Driving, Virtual Reality, Medical Imaging, PIPS, TAPIR, Occlusions, Joint Tracking, 4D Correlation Features, Knowledge Distillation, CNNs (Convolutional Neural Networks), Feature Mapping, Downsampling, Multi-scale Features, Robotics, Video Generation, Sports Analysis, MovieGen, Skynet