Venture Step

Meta's Sparsh: A New Era for Robotic Touch Sensing

Meta's Sparsh paper introduces a leap in robotic touch, moving from costly, task-specific models to general-purpose, self-supervised learning. Listen to the full episode to learn more.

Dalton Anderson

03 Dec 2024 • 9 min read

TL;DR

Meta's Sparsh is changing robotics by teaching machines to 'feel' like humans, using self-supervised learning and open data to create adaptable, general-purpose touch. #VentureStep #AI #Robotics

INTRODUCTION

While robotics has made incredible strides in vision and audio processing, the sense of touch has remained a significant bottleneck. ¹For years, progress was hampered by proprietary, task-specific models that were expensive to develop and couldn't generalize to new situations. ²Each company or research lab was creating custom sensors and siloed datasets, preventing the kind of collaborative progress seen in other areas of AI. ³This lack of standardization made it difficult to share data, compare results, and build upon previous work. ⁴

In this episode, host Dalton Anderson dives into a groundbreaking research paper from Meta's FAIR lab that addresses these challenges head-on. ⁵The paper introduces Sparsh, an ecosystem designed to create a generalized, self-supervised model for robotic touch. ⁶By moving away from costly labeled data and towards a system that learns from observation, Meta aims to democratize tactile sensing and accelerate innovation across the entire field of robotics. ⁷⁷⁷

Dalton breaks down the core concepts behind Sparsh, explaining the critical shift from supervised to self-supervised learning and how it allows models to understand the nuances of touch without explicit instructions. ⁸⁸⁸⁸The discussion covers Meta’s release of a massive public dataset to fuel research, the creation of a standardized evaluation system called TacBench, and the fascinating, if not brutal, methods used to train these models to interact with the physical world. ⁹⁹⁹

KEY TAKEAWAYS

The field of robotic touch has been limited by expensive, task-specific models that relied on supervised learning and lacked standardized benchmarks. ¹⁰¹⁰
Meta's Sparsh introduces a general-purpose approach using self-supervised learning, allowing a model to understand tactile interactions without needing costly labeled data. ^{11111111111111}
To fuel industry-wide progress, Meta released a massive open dataset of 475,000 tactile images and created TacBench, a standardized benchmark for evaluating touch representations. ¹²¹²
Sparsh uses vision-based tactile sensors—squishy gel "nubs"—that deform on contact, allowing a camera to capture and interpret the force, texture, and slippage of an object. ¹³¹³¹³
The ultimate goal of this research is to make robots more versatile, adaptable, and efficient, enabling them to perform delicate tasks like assembly or surgery with precise control. ¹⁴¹⁴

FULL CONVERSATION

The Current Challenges in Robotic Touch

Dalton: Welcome to Venture Step Podcasts where we discuss entrepreneurship, industry trends, and the occasional look of you. ¹⁵There have been quite a few technologies that have completely altered the way that we live and communicate. ¹⁶The easiest of these to talk about would be computers. ¹⁷Computers were originally the size of a car, and then we optimized the cost and size of the technology required to execute tasks, and that computer is now within your home. ¹⁸ The next step was phones. Similar things are happening with robotics. ¹⁹

Dalton: This paper that Meta came out with from FAIR discusses digitizing touch, and the paper is "Sparse Self-Supervised Touch Representations for Vision-Based tactile sensing." ²⁰ That's what we're going to be discussing today. Robots have vision, and they're integrating LLMs for voice interactions and audio inputs that can be encoded to understand audio, turn it into text, process it, and then respond back with a voice. ²¹But one thing that we're missing is the generalization of touch. ²²

Dalton: There have been quite a few proprietary or custom-built models for touch, but they have been task-based. ²³What that means is we would train a robot to do one thing very well. ²⁴But if you try to do anything else, it wouldn't work. ²⁵Meta saw a couple of issues with how we're going about this application. ²⁶One of the crucial things is changing the way that the data is processed and what kind of data we can use. ²⁷

Dalton: They talked about how everyone was building custom-made sensors and there was no standardization of benchmarks. ²⁸ What benchmarks are we collecting? How do we collect them? How do we store them? ²⁹

Dalton: It's also difficult to integrate all these sensors to these benchmarks. ³⁰So they tried to create an ecosystem, which is what Sparsh is. ³¹There was an issue with being able to share data, how the data was being trained, the cost of the data, and the limitation of access to the data and the expense of obtaining classified data. ³²

Supervised vs. Self-Supervised Learning: A Critical Shift

Dalton: One of the key issues is that these programs were making their own custom models and they were using supervised training data. ³³Supervised training data is something you would use in a regression or classification model. ³⁴ Basically, that means you have a data set. The easiest example is, is this a dog or a cat? ³⁵We're just saying, is this a dog or is it a cat? ³⁶

Dalton: In supervised learning, you have the data set, and the data set has metadata associated with an image. ³⁷In this case, you have an image of a black cat, and associated with it would be additional data saying, "this is a black cat." ³⁸The model would then know what a black cat looks like. ³⁹You might have 10,000 images of black cats, and then the same for white cats, and for different dogs. ⁴⁰

Dalton: But then what if you try to give the model a purple dog or some kind of cat or dog it wasn't trained on? ⁴¹It doesn't work. ⁴² Or what if you wanted to do dogs, cats, and maybe birds? It wouldn't know what a bird was because it had never seen a bird before. ⁴³You have to have everything labeled and classified, which is very time-consuming and expensive. ⁴⁴

Dalton: Instead of using supervised learning, they use self-supervised learning. ^45454545The model has to understand the relationship between what the actual image is, what the relationship of the image is to how it interacts with the robot and touch—with the different textures, the shape, the force applied to these virtual images and videos, and how that affects slippage. ⁴⁶

Dalton: Self-supervised learning is described as a teacher-student situation where the student, which is one of two parallel models, has to use the teachings from the teacher model to get a similar result. ^47474747It's good at things like predicting missing parts of an image, understanding unlabeled images, and grouping similar images. ⁴⁸Self-supervised learning is kind of how we would go about learning as humans. ⁴⁹

Sparsh: Meta’s Three-Part Solution

Dalton: So what Meta is doing is they made their own data set of 475,000 tactile images that will allow other people to train their models on. ⁵⁰The general sense of this is one, make it easier for people to play around with digitalizing touch. ⁵¹Get it into as many people's hands as possible. ⁵²Two, release a dataset for the cohort of robotics experts, engineers, and researchers. ⁵³And the last thing is to generalize these tasks. ⁵⁴

Dalton: The idea is just to understand your world and interacting with objects in your world instead of just training for one object or a couple of objects in this specified task. Train for everything. ⁵⁵

Dalton: In a general sense, it seems to be solving some crucial problems to move robotics forward. ⁵⁶This would just make robots more versatile, adaptable, and efficient. ⁵⁷There was a blatant inefficiency of private companies continuously training and gathering data and then creating their own custom models. ⁵⁸Now we have this massive open public data set, and it's open for researchers to publish whatever they need to. ⁵⁹

How Vision-Based Sensors Digitize Touch

Dalton: One thing the paper explained was the visualization of touch, which is the vision-based tactile sensors. It's quite interesting. ⁶⁰ They have sensors with markers and without markers. The ones with markers have these little dots on the gel, it's like a squishy gel thing. ⁶¹I would call them gel nubs. ⁶²Little robot gel nubs. ⁶³

Dalton: When you touch something, think about the tip of your finger. The skin on the tip of your finger kind of changes a bit. 64You could see how it slides or presses against the object you're holding. 65Same thing with these little gels. 66The gels with the markers understand, okay, one of my points on my gel now moved from this position to another position. 67So, this was the amount of touch, this was the amount of slippage, this was the amount of force applied before slippage happened. 68

Dalton: It's able to mathematically understand the interactions between the robot gel nubs, the robot's grip, and the object that it's touching. ⁶⁹I think you would explain it as a little gel nub that's squishy, and when you touch it, the gel deforms. ⁷⁰ When the gel deforms, there are measurements taken. From there, that information is processed, encoded, and stored. ⁷¹

Introducing TacBench: A Standard for Robot Touch

Dalton: And then they made this other thing called TacBench, and they standardized a benchmark system for the evaluation of touch representations. ⁷²The things that they standardized were slip detection, poise, grasp stability, texture recognition, and force estimation. ⁷³How can the model properly detect objects slipping digitally? ⁷⁴

Dalton: Poise estimation is understanding where this object is in a 3D space and understanding the distance between you and the object. ⁷⁵There's grasp stability, texture recognition, and understanding how that might affect how much force is required and how that will affect your grip, which would affect your slip. ⁷⁶

The Brutal Training: Bead Mazes and Masked Videos

Dalton: Then there's the bead maze problem, which I had no idea about. ⁷⁷A bead maze is a children's toy that's used for hand-eye coordination. ⁷⁸ It's a wired toy that has beads on it, and you try to get the bead through the maze. They do the same thing with the robot, which sounds like torture because they don't train it on what the right answer is. ⁷⁹

Dalton: You got to think about it. It's like a blind man trying to go through a maze with only his hand. ⁸⁰It's in the dark, can't see, only has its hand. ⁸¹So it could feel that one, I am pressing on the bead, I'm holding it, and I'm moving with little resistance. ⁸² Okay, so this might be the right answer. And then you keep moving and then there's resistance that comes up. That's the wrong answer. ⁸³It just sounds like torture to me. ⁸⁴

Dalton: There was this other thing they were talking about, which was video tubing. Tubing video modeling is a technique of hiding 30 to 40 seconds of a video randomly and having the model predict what happened. That sounds brutal as well. 85Some of this stuff just sounds straight up like torture. 86This bead maze thing just sounds excruciating. 87

Visualizing Force with Heat Maps

Dalton: So how does Sparsh understand these tactile images? ⁸⁸ It has a standardized benchmark. It understands and was trained on tactile data. ⁸⁹When it's in action, it's basically just understanding and visualizing the force fields. ⁹⁰The way that Meta did it was they have a pixelized grid map. ⁹¹

Dalton: To understand the force and the force manipulation that was required, it would have a heat map of where the object was touched and then how hard—how much force was put on the object with the grip, where it was touched, and how it was touched. ⁹²Then you can measure, okay, you grabbed the object here, you slipped, or you didn't grab it hard enough, and you can understand what was going wrong. ⁹³ Their method makes a lot more sense than the other method. Visualization is always easier to understand. ^94949494

The Real-World Implications of Sparsh

Dalton: In a general sense, you're utilizing the generalization of touch, standardizing the benchmarks required to understand what touch is, and then you're visualizing how your grip interacted with the object. ⁹⁵

Dalton: So understanding what's the right amount of force while preventing damage, but also not enabling slippage. ⁹⁶ How do you manipulate objects without crushing them? Performing tasks that require precise control like assembly or surgery. ⁹⁷

Dalton: Has this stuff worked? Yes, it has. 98If you listened to the last episode, I talked about how it was like 93% better than the other methods. 99It's been outperforming traditional force estimation models. 100And that is a factor of one, standardizing the data; two, having a good generalized approach with self-supervised learning; and then understanding and visualizing these touches and how this works. 101

Final Thoughts on the Sparsh Paper

Dalton: I think that would give you a general sense of how this podcast worked. ¹⁰²I think I provided enough information for this podcast episode to be useful. ¹⁰³I encourage everyone, if you're interested, to read the paper. ¹⁰⁴I think it's a well-done paper. ¹⁰⁵

Dalton: I just worked late and didn't necessarily want to do this podcast episode, but I read the paper and I was just so energized with the knowledge and intrigued and just overall fascinated with the information and how they went about solving the problems. ¹⁰⁶

Dalton: I thought it was quite interesting, the difference in performance when you trained on one-third of the data, half the data, or a tenth of the data. ¹⁰⁷It seems like in certain scenarios, you could get away with only training with a tenth of the data. ¹⁰⁸It's only like a half a percent or 0.5% difference. ¹⁰⁹There are some cool snippets that I didn't necessarily get to talk about in this episode. ¹¹⁰

Dalton: If you thought this podcast was interesting, give me a comment and let me know what you think, what you found interesting, or if you have any additional insights if you've worked in this space. ¹¹¹ Once again, I'm not a robotics researcher. I do not work in computer vision. ¹¹²I work in insurance, but I did find this paper quite interesting. ¹¹³

RESOURCES MENTIONED

Research Paper: Sparse Self-Supervised Touch Representations for Vision-Based tactile sensing by Meta FAIR
Research Paper: Herding the Llamas
Companies: Meta, FAIR, OpenAI, Anthropic, Google
Podcast Platforms: YouTube, Apple Podcasts, Spotify

INDEX OF CONCEPTS

Sparsh, Meta, FAIR, Self-Supervised Learning, Supervised Learning, TacBench, Poise, Slip Detection, Grasp Stability, Texture Recognition, Force Estimation, Bead Maze, Tubing Video Modeling, Vision-Based Tactile Sensors, OpenAI, Anthropic, Google, Herding the Llamas