Meta's Sparsh: A New Era for Robotic Touch Sensing

Meta's Sparsh paper introduces a leap in robotic touch, moving from costly, task-specific models to general-purpose, self-supervised learning. Listen to the full episode to learn more.

Meta's Sparsh: A New Era for Robotic Touch Sensing

TL;DR

Meta's Sparsh is changing robotics by teaching machines to 'feel' like humans, using self-supervised learning and open data to create adaptable, general-purpose touch. #VentureStep #AI #Robotics

INTRODUCTION

While robotics has made incredible strides in vision and audio processing, the sense of touch has remained a significant bottleneck. 1For years, progress was hampered by proprietary, task-specific models that were expensive to develop and couldn't generalize to new situations. 2Each company or research lab was creating custom sensors and siloed datasets, preventing the kind of collaborative progress seen in other areas of AI. 3This lack of standardization made it difficult to share data, compare results, and build upon previous work. 4

In this episode, host Dalton Anderson dives into a groundbreaking research paper from Meta's FAIR lab that addresses these challenges head-on. 5The paper introduces Sparsh, an ecosystem designed to create a generalized, self-supervised model for robotic touch. 6By moving away from costly labeled data and towards a system that learns from observation, Meta aims to democratize tactile sensing and accelerate innovation across the entire field of robotics. 777

Dalton breaks down the core concepts behind Sparsh, explaining the critical shift from supervised to self-supervised learning and how it allows models to understand the nuances of touch without explicit instructions. 8888The discussion covers Meta’s release of a massive public dataset to fuel research, the creation of a standardized evaluation system called TacBench, and the fascinating, if not brutal, methods used to train these models to interact with the physical world. 999

KEY TAKEAWAYS

  • The field of robotic touch has been limited by expensive, task-specific models that relied on supervised learning and lacked standardized benchmarks. 1010
  • Meta's Sparsh introduces a general-purpose approach using self-supervised learning, allowing a model to understand tactile interactions without needing costly labeled data. 11111111111111
  • To fuel industry-wide progress, Meta released a massive open dataset of 475,000 tactile images and created TacBench, a standardized benchmark for evaluating touch representations. 1212
  • Sparsh uses vision-based tactile sensors—squishy gel "nubs"—that deform on contact, allowing a camera to capture and interpret the force, texture, and slippage of an object. 131313
  • The ultimate goal of this research is to make robots more versatile, adaptable, and efficient, enabling them to perform delicate tasks like assembly or surgery with precise control. 1414

FULL CONVERSATION

The Current Challenges in Robotic Touch

Dalton: Welcome to Venture Step Podcasts where we discuss entrepreneurship, industry trends, and the occasional look of you. 15There have been quite a few technologies that have completely altered the way that we live and communicate. 16The easiest of these to talk about would be computers. 17Computers were originally the size of a car, and then we optimized the cost and size of the technology required to execute tasks, and that computer is now within your home. 18 The next step was phones. Similar things are happening with robotics. 19

Dalton: This paper that Meta came out with from FAIR discusses digitizing touch, and the paper is "Sparse Self-Supervised Touch Representations for Vision-Based tactile sensing." 20 That's what we're going to be discussing today. Robots have vision, and they're integrating LLMs for voice interactions and audio inputs that can be encoded to understand audio, turn it into text, process it, and then respond back with a voice. 21But one thing that we're missing is the generalization of touch. 22

Dalton: There have been quite a few proprietary or custom-built models for touch, but they have been task-based. 23What that means is we would train a robot to do one thing very well. 24But if you try to do anything else, it wouldn't work. 25Meta saw a couple of issues with how we're going about this application. 26One of the crucial things is changing the way that the data is processed and what kind of data we can use. 27

Dalton: They talked about how everyone was building custom-made sensors and there was no standardization of benchmarks. 28 What benchmarks are we collecting? How do we collect them? How do we store them? 29

Dalton: It's also difficult to integrate all these sensors to these benchmarks. 30So they tried to create an ecosystem, which is what Sparsh is. 31There was an issue with being able to share data, how the data was being trained, the cost of the data, and the limitation of access to the data and the expense of obtaining classified data. 32

Supervised vs. Self-Supervised Learning: A Critical Shift

Dalton: One of the key issues is that these programs were making their own custom models and they were using supervised training data. 33Supervised training data is something you would use in a regression or classification model. 34 Basically, that means you have a data set. The easiest example is, is this a dog or a cat? 35We're just saying, is this a dog or is it a cat? 36

Dalton: In supervised learning, you have the data set, and the data set has metadata associated with an image. 37In this case, you have an image of a black cat, and associated with it would be additional data saying, "this is a black cat." 38The model would then know what a black cat looks like. 39You might have 10,000 images of black cats, and then the same for white cats, and for different dogs. 40

Dalton: But then what if you try to give the model a purple dog or some kind of cat or dog it wasn't trained on? 41It doesn't work. 42 Or what if you wanted to do dogs, cats, and maybe birds? It wouldn't know what a bird was because it had never seen a bird before. 43You have to have everything labeled and classified, which is very time-consuming and expensive. 44

Dalton: Instead of using supervised learning, they use self-supervised learning. 45454545The model has to understand the relationship between what the actual image is, what the relationship of the image is to how it interacts with the robot and touch—with the different textures, the shape, the force applied to these virtual images and videos, and how that affects slippage. 46

Dalton: Self-supervised learning is described as a teacher-student situation where the student, which is one of two parallel models, has to use the teachings from the teacher model to get a similar result. 47474747It's good at things like predicting missing parts of an image, understanding unlabeled images, and grouping similar images. 48Self-supervised learning is kind of how we would go about learning as humans. 49

Sparsh: Meta’s Three-Part Solution

Dalton: So what Meta is doing is they made their own data set of 475,000 tactile images that will allow other people to train their models on. 50The general sense of this is one, make it easier for people to play around with digitalizing touch. 51Get it into as many people's hands as possible. 52Two, release a dataset for the cohort of robotics experts, engineers, and researchers. 53And the last thing is to generalize these tasks. 54

Dalton: The idea is just to understand your world and interacting with objects in your world instead of just training for one object or a couple of objects in this specified task. Train for everything. 55

Dalton: In a general sense, it seems to be solving some crucial problems to move robotics forward. 56This would just make robots more versatile, adaptable, and efficient. 57There was a blatant inefficiency of private companies continuously training and gathering data and then creating their own custom models. 58Now we have this massive open public data set, and it's open for researchers to publish whatever they need to. 59

How Vision-Based Sensors Digitize Touch

Dalton: One thing the paper explained was the visualization of touch, which is the vision-based tactile sensors. It's quite interesting. 60 They have sensors with markers and without markers. The ones with markers have these little dots on the gel, it's like a squishy gel thing. 61I would call them gel nubs. 62Little robot gel nubs. 63

Dalton: When you touch something, think about the tip of your finger. The skin on the tip of your finger kind of changes a bit. 64You could see how it slides or presses against the object you're holding. 65Same thing with these little gels. 66The gels with the markers understand, okay, one of my points on my gel now moved from this position to another position. 67So, this was the amount of touch, this was the amount of slippage, this was the amount of force applied before slippage happened. 68

Dalton: It's able to mathematically understand the interactions between the robot gel nubs, the robot's grip, and the object that it's touching. 69I think you would explain it as a little gel nub that's squishy, and when you touch it, the gel deforms. 70 When the gel deforms, there are measurements taken. From there, that information is processed, encoded, and stored. 71

Introducing TacBench: A Standard for Robot Touch

Dalton: And then they made this other thing called TacBench, and they standardized a benchmark system for the evaluation of touch representations. 72The things that they standardized were slip detection, poise, grasp stability, texture recognition, and force estimation. 73How can the model properly detect objects slipping digitally? 74

Dalton: Poise estimation is understanding where this object is in a 3D space and understanding the distance between you and the object. 75There's grasp stability, texture recognition, and understanding how that might affect how much force is required and how that will affect your grip, which would affect your slip. 76

The Brutal Training: Bead Mazes and Masked Videos

Dalton: Then there's the bead maze problem, which I had no idea about. 77A bead maze is a children's toy that's used for hand-eye coordination. 78 It's a wired toy that has beads on it, and you try to get the bead through the maze. They do the same thing with the robot, which sounds like torture because they don't train it on what the right answer is. 79

Dalton: You got to think about it. It's like a blind man trying to go through a maze with only his hand. 80It's in the dark, can't see, only has its hand. 81So it could feel that one, I am pressing on the bead, I'm holding it, and I'm moving with little resistance. 82 Okay, so this might be the right answer. And then you keep moving and then there's resistance that comes up. That's the wrong answer. 83It just sounds like torture to me. 84

Dalton: There was this other thing they were talking about, which was video tubing. Tubing video modeling is a technique of hiding 30 to 40 seconds of a video randomly and having the model predict what happened. That sounds brutal as well. 85Some of this stuff just sounds straight up like torture. 86This bead maze thing just sounds excruciating. 87

Visualizing Force with Heat Maps

Dalton: So how does Sparsh understand these tactile images? 88 It has a standardized benchmark. It understands and was trained on tactile data. 89When it's in action, it's basically just understanding and visualizing the force fields. 90The way that Meta did it was they have a pixelized grid map. 91

Dalton: To understand the force and the force manipulation that was required, it would have a heat map of where the object was touched and then how hard—how much force was put on the object with the grip, where it was touched, and how it was touched. 92Then you can measure, okay, you grabbed the object here, you slipped, or you didn't grab it hard enough, and you can understand what was going wrong. 93 Their method makes a lot more sense than the other method. Visualization is always easier to understand. 94949494

The Real-World Implications of Sparsh

Dalton: In a general sense, you're utilizing the generalization of touch, standardizing the benchmarks required to understand what touch is, and then you're visualizing how your grip interacted with the object. 95

Dalton: So understanding what's the right amount of force while preventing damage, but also not enabling slippage. 96 How do you manipulate objects without crushing them? Performing tasks that require precise control like assembly or surgery. 97

Dalton: Has this stuff worked? Yes, it has. 98If you listened to the last episode, I talked about how it was like 93% better than the other methods. 99It's been outperforming traditional force estimation models. 100And that is a factor of one, standardizing the data; two, having a good generalized approach with self-supervised learning; and then understanding and visualizing these touches and how this works. 101

Final Thoughts on the Sparsh Paper

Dalton: I think that would give you a general sense of how this podcast worked. 102I think I provided enough information for this podcast episode to be useful. 103I encourage everyone, if you're interested, to read the paper. 104I think it's a well-done paper. 105

Dalton: I just worked late and didn't necessarily want to do this podcast episode, but I read the paper and I was just so energized with the knowledge and intrigued and just overall fascinated with the information and how they went about solving the problems. 106

Dalton: I thought it was quite interesting, the difference in performance when you trained on one-third of the data, half the data, or a tenth of the data. 107It seems like in certain scenarios, you could get away with only training with a tenth of the data. 108It's only like a half a percent or 0.5% difference. 109There are some cool snippets that I didn't necessarily get to talk about in this episode. 110

Dalton: If you thought this podcast was interesting, give me a comment and let me know what you think, what you found interesting, or if you have any additional insights if you've worked in this space. 111 Once again, I'm not a robotics researcher. I do not work in computer vision. 112I work in insurance, but I did find this paper quite interesting. 113

RESOURCES MENTIONED

  • Research Paper: Sparse Self-Supervised Touch Representations for Vision-Based tactile sensing by Meta FAIR
  • Research Paper: Herding the Llamas
  • Companies: Meta, FAIR, OpenAI, Anthropic, Google
  • Podcast Platforms: YouTube, Apple Podcasts, Spotify

INDEX OF CONCEPTS

Sparsh, Meta, FAIR, Self-Supervised Learning, Supervised Learning, TacBench, Poise, Slip Detection, Grasp Stability, Texture Recognition, Force Estimation, Bead Maze, Tubing Video Modeling, Vision-Based Tactile Sensors, OpenAI, Anthropic, Google, Herding the Llamas