Meta's Llama 3: Safety, Scaling, and Simple Solutions

Explore Meta's Llama 3 research paper, from advanced red teaming and safety protocols to scaling the 400B+ model. Listen to the full episode to learn more.

Dalton Anderson

27 Aug 2024 • 10 min read

TL;DR

Meta's Llama 3 paper reveals a surprising secret to building its 400B+ parameter model: radical simplicity. High-quality data and simple scaling beat complexity every time. #VentureStep #Llama3 #AI

INTRODUCTION

Meta recently released Llama 3, including a groundbreaking 400-billion-plus parameter model, the first open-source foundational model of its kind. While the release generated significant buzz, the real insights are buried in their comprehensive 92-page research paper. This document details their architecture, training data, and the specific challenges they overcame at a massive scale.

In this episode of Venture Step, host Dalton Anderson breaks down the second half of this dense but valuable paper. He moves beyond the initial architecture to explore the critical aspects of model safety, inference efficiency, and the experiments shaping Llama 3's future multimodal capabilities. This analysis provides a rare look under the hood at how one of the world's most advanced AI models was responsibly built and scaled.

From sophisticated red teaming exercises designed to prevent misuse to clever engineering solutions like pipeline parallelism, Meta's approach reveals a core philosophy: keep it simple. Dalton explores how this principle guided their decisions on everything from model quantization to their experiments with vision and voice, offering powerful lessons for anyone building complex technology.

KEY TAKEAWAYS

Meta employs a multi-layered safety architecture with Prompt Guard on the front end to block malicious inputs and Llama Guard on the back end to classify and prevent harmful outputs.
Multilingual prompt attacks are a significant vulnerability for LLMs, often proving more effective at jailbreaking a model than English-only attempts.
For optimal performance, it's better to use a larger model with lower numerical precision (quantization) than a smaller model with higher precision.
Meta’s core development philosophy for Llama 3 was to focus on high-quality data, effective scaling, and keeping engineering solutions as simple as possible.
Future vision capabilities are being trained heavily on short-form video clips (under one minute), hinting at deep integration with platforms like Instagram Reels.

FULL CONVERSATION

Dalton: Llama three is here. ¹Llama was introduced a couple of weeks ago from Metta. ²This is their new iteration of llama and they came out with their over 400 billion parameter model, which is the first open source foundational model. ³Last couple of weeks we broke down the first half of their research paper that they published relating to their findings, their architecture, their training data, and overall just different troubleshooting things that they had to solution for while doing something like this at scale. ⁴This week we're gonna break down the second half of the paper, which mostly focuses on how they went about red teaming, some of the things that they're troubleshooting against and what were some of their solutions. ⁵They also introduced some new tools that we'll discuss and how they built those tools. ⁶

Understanding Red Teaming and Uplift Testing

Dalton: So red teaming. ⁷And if you're not familiar with red teaming, red teaming is like a way of, ethnically hacking these models. ⁸So you're intentionally trying to break the model and to do bad things, but you're not trying to do actual bad things. ⁹We're trying to find a way to break the model and then harden your defenses. ¹⁰Meta had independent red teams, which were not part of meta like a third party that was vetted and they kind of validate metas findings but then meta had an internal red team which they they worked on different things and one of the one of the biggest things that they try to do is they tried to one see if there's any what's called lift so it's uplift testing so uplift testing entails, Hey, if you use this model, do you have increased capabilities? ¹¹

Dalton: So as an adversary, so if you're a hacker, if you're trying to make chemical weapons, nuclear weapons, explosives, those kind of things that are classified as weapons, no, not necessarily all those are weapons of mass destruction. ¹²But those are weapons classified by the world as things that can cause many casualties. ¹³So the red teams try to break the model and force Lama or any of the Lama models to go against the protections that Meta put into place. ¹⁴

The Anatomy of a Prompt Attack

Dalton: There's a couple of different approaches. There is a multi turn approach, which is like, say I tell you to do something you say no. ¹⁵And then I'm like, okay, and then I ask you a different way and you say no. ¹⁶And I just keep asking you until you eventually you have a refusal, what's called a refusal suppression to where you'll temporarily are subdued to my request. ¹⁷And that works sometimes. ¹⁸What's better, I think if you're trying to do these things as a hypothetical, and you can kind of get the same vibe when you are asking someone like, hypothetically, how could you do this? ¹⁹

If you had a multi-term refusal and then you added in these multi-term prompt requests to attack multi turn attack with these hypotheticals and then you added a persona on there with including role play and then you gradually escalate the violations the request for violations Before you know you you've broken the model and you've jailbreak it.

Dalton: What is really good is to do a multi-leveled prompt attack with many different Prompt types. ²⁰If you had a multi-term refusal and then you added in these multi-term prompt requests to attack with hypotheticals and then you added a persona on there with role play, and then you gradually escalate the request for violations, before you know it you've broken the model or jailbroke it. ²¹

The Challenge of Multilingual Vulnerabilities

Dalton: But what one thing I didn't think about and I didn't realize was multilingual attacks. ²²So if you combine that plus multilingual, then it's easier to break in. ²³What they found from their red team testing was that if you added in these different level attacks, you're more successful than if you added only in one attack. ²⁴But one thing they also saw was if it wasn't in the English language, it was a lot easier to break in. ²⁵So they became very stringent on the other languages that they were utilizing for the model and what you could use as inputs. ²⁶And then what they did was they found out that if you combine these multi-prompt attacks with multilingual, then it becomes even more easy. ²⁷

Llama Guard: Meta's Back-End Defense

Dalton: So then they built they rebuilt their llama three guard. ²⁸So they had llama two guard. ²⁹And what their llama three guard is built off is the same kind of approach. ³⁰Their llama two was built off their 8 billion parameter model, and then they condensed it into like 500 million parameters. ³¹And they turn it into a classifier of dangerous information. ³²And so that that dangerous information is you trained on 13 hazardous categories. ³³It is from their AI safety taxonomy by vigilant as of 2024. ³⁴

Dalton: So that's llama guard. ³⁵And llama guard is the output, right? ³⁶So they're protecting the output. ³⁷ So they've got this model. The model has an input. And then the model processes all the text. ³⁸And then if it wants to output something that is potentially dangerous. ³⁹It's first classified by this classifier. ⁴⁰And then if it's not, then it's fed back through the model like, hey, this isn't okay, we can't send this. ⁴¹It's fed back through and then it will come back to you with a hopefully not dangerous result. ⁴²

Prompt Guard and Code Shield: Securing the Input

Dalton: But what about dangerous prompts? ⁴³Like, okay, so what about fixing before you even get to the problem, preventing the problem from even happening? ⁴⁴So Meta came out with prompt guard and code shield. ⁴⁵And those are the two things that they trained a classifier on. ⁴⁶They added that in on the front end. ⁴⁷So they made the prompt guard. ⁴⁸So prompt guard prevents prompts that are potentially dangerous. ⁴⁹ And so it's the same kind of concept. It's a classifier. ⁵⁰This one's a multi-language classifier. ⁵¹And it looks for different things. ⁵²

So if you think about the safety architecture of Meta's models, Lama models, it's they have prompt guard that's on the front part... And then after the model, have an additional classifier that is trained specifically on malicious code, violent data, data related to chemical weapons, nuclear weapons, explosives, those things.

Dalton: So they have safety on the front end, safety on the back end, which I think is pretty interesting. ⁵³

Scaling Inference with Pipeline Parallelism

Dalton: Okay, so then we're moving over to the next section, which is inference. ⁵⁴And so they talked about pipeline parallelism. ⁵⁵One of the issues that they had with the large 405 billion parameter model was that the model was too big to run on a single machine with eight NVIDIA H100s. ⁵⁶So what they did is they opened up the inference to be parallel. ⁵⁷So they use parallelism. ⁵⁸And that's basically like you're running these tasks in parallel on different machines. ⁵⁹They enabled this pipeline parallelism where they run the inference through the model inferencing with micro batching with many different machines concurrently and all of the processing stages. ⁶⁰

Does Model Precision Matter? The Quantization Debate

Dalton: This goes to the next point of what they did. ⁶¹They researched the FP, so floating point of the model. ⁶²A floating point is, I would think about it in a simple manner, is the precision of the binary representation of the data, right? ⁶³So floating point 64 would have more precise data, but the trade off of that is it's more data, right? ⁶⁴They did a study in this paper and their results were that, hey, it's not really that big of a difference. ⁶⁵Like there is a difference for sure, but it's not a overly material difference between like a eight and a 16 from a 64. ⁶⁶The 64 is better. ⁶⁷

But what is important is okay, if you were looking to download this model and you only had enough space... it's better to do the model regardless of what floating point it's at the bigger model, the one that has more parameters.

Dalton: Low precision inferences make large language models more accessible and practical for the real world. ⁶⁸And if you have the choice to run at a lower floating point with a bigger model or a higher floating point with a smaller model, pick the higher parameter model lower floating point. ⁶⁹

Llama 3's Vision Experiments and Short-Form Video

Dalton: This part of the paper I found super interesting was the video experiments and their audio experiments. ⁷⁰For a video, which we'll talk about right now, they just used a whole bunch of different types of data. ⁷¹They used it in a text prompt format. ⁷²They have an image and then they have the text paired with it. ⁷³The data was trained on both video and image data. ⁷⁴Majority of the video data, which I thought was interesting, was lower than one minute. ⁷⁵ The average duration was 21 seconds. The median was 16 seconds. ⁷⁶And with over 99% of videos being under one minute. ⁷⁷

Dalton: Very interesting that they're training majority of the model on videos under a minute. ⁷⁸For me, that tells me, and it's kind of a shot in the dark here, it like they're preparing for a change to Instagram reels. ⁷⁹

Solving the Vision Scaling Problem

Dalton: What issue that they had was okay, it was quite time consuming for the model to process images and videos with this architecture. ⁸⁰They said on average, an image had about 2000 tokens. ⁸¹So what the issue they had was, okay, like the process one image or to process one text prompt, like there's such a large difference in data associated with an image than there is text. ⁸²What was happening was like the cross attention layers, which is what the model focuses on, would be the image, but then it was kind of not paying attention to the text. ⁸³And so what they did was they added, sequence parallelization into the image encoder so that each GPU processes the same amount of token. ⁸⁴

Exploring the Future of Voice Capabilities

Dalton: Okay, so voice experiments. ⁸⁵They did some voice experiments, which I thought were pretty cool. ⁸⁶So it's ASR, not ASMR. ⁸⁷So ASR versus ASMR. ⁸⁸Automatic speech recognition. ⁸⁹Okay, so it's like text that was turned into audio or audio, I guess. ⁹⁰So they use the ASR data, which is voice turned into text. ⁹¹So they had 230,000 hours of manually transcribed speech recording that span across 34 language of ASR data. ⁹²

But ASMR is the purpose of converting audio data like voice calls, voice searches on your phone and podcasts into a format computers can understand, often readable text.

Dalton: Basically they said that the model is able is multilingual. ⁹³It can understand speech recognition for multiple languages. ⁹⁴It has translation. ⁹⁵And so they're saying that it has a good potential to break down language barriers among people communicating across cultures, which I think is interesting. ⁹⁶

The Core Philosophy: Simplicity is the Solvent for Complexity

Dalton: Their real suggestion with this approach in the model and the things that they learn was okay to develop a high quality foundational model. ⁹⁷There's still a lot of discovery that needs to done. ⁹⁸But while they were doing that, they focused on high quality data. ⁹⁹And they focused on scaling their processes and keeping it simple. ¹⁰⁰That's it. ¹⁰¹That's really the feedback. ¹⁰²

So when you're doing something complicated in your life... and you want to scale that to your team or other people in your life or just general situations or your business, keep it simple.

Dalton: And so, for this complex problem, they're chewing away at it little by little with simplicity. ¹⁰³Simplicity is the solvent for complexity. ¹⁰⁴

RESOURCES MENTIONED

Llama 3 Research Paper
Meta AI
Google (1.5 Pro, Ultra)
OpenAI (ChatGPT-4 Turbo)
NVIDIA H100 GPUs
Instagram Reels
Facebook
Ray-Ban Meta Glasses

INDEX OF CONCEPTS

Dalton Anderson, Llama 3, Llama 2 Guard, Llama 3 Guard, Meta, AI Safety, Red Teaming, Uplift Testing, Prompt Attacks, Multi-turn Attacks, Refusal Suppression, Multilingual Attacks, Prompt Guard, Code Shield, Inference, Pipeline Parallelism, Micro-batching, Quantization, Floating Point Precision, ASR (Automatic Speech Recognition), Sequence Parallelization, Cross-attention Layers, Google, OpenAI, ChatGPT-4 Turbo, NVIDIA H100, Instagram Reels, Facebook, Ray-Ban