Venture Step

Meta's Llama 3.1: Inside the AI Research Paper

Dalton Anderson

13 Aug 2024 • 13 min read

TL;DR

Meta's Llama 3.1 paper reveals a focus on a stable transformer architecture over Mixture of Experts and massive in-house infrastructure to train its models on 15 trillion tokens. #VentureStep #AI #Llama3

INTRODUCTION

The release of powerful AI models like Meta's Llama 3.1 often focuses on their impressive capabilities, but what's happening under the hood? Understanding the architecture, training methodologies, and safety protocols is crucial for entrepreneurs and developers looking to leverage these tools effectively. The research papers accompanying these models are treasure troves of information, but their density can be a significant barrier to entry. ¹¹¹¹

In this episode of VentureStep, host Dalton Anderson dives deep into Meta's research paper, titled "Herd of Llamas," to demystify the complex processes behind their latest model. ²The discussion breaks down the first half of the paper, focusing on the model's architecture, the extensive pre-training and post-training processes, and the critical safety measures Meta implemented. ³

From choosing a stable transformer architecture to meticulously cleaning 15 trillion tokens of data, this analysis reveals the sheer scale and complexity of building a state-of-the-art large language model. ⁴⁴ It's a look at the deliberate decisions, the immense hardware infrastructure, and the iterative problem-solving required to push the boundaries of artificial intelligence.

KEY TAKEAWAYS

Meta prioritized stability by choosing a dense transformer architecture over the more complex Mixture of Experts (MoE) model for Llama 3.1. ⁵
The 405B foundational model was trained on 15 trillion tokens, which were meticulously cleaned using a multi-step deduplication process across URLs, documents, and domains. ⁶⁶⁶⁶⁶
Meta developed extensive in-house infrastructure, including a custom job scheduler and file system, to manage the immense complexity and hardware challenges of training at scale. ⁷⁷
To enhance capabilities in niche areas like math and coding, Meta created specialized "expert" models to generate high-quality, synthetic training data for Llama 3.1. ⁸⁸⁸
The model's safety was rigorously tested through internal red teaming and uplift testing to identify and mitigate risks like insecure code generation and prompt injection. ^99999999

FULL CONVERSATION

Dalton Anderson: Welcome to VentureStep podcast, where we discuss entrepreneurship, industry trends, and the occasional book review. ¹⁰We're continuing our series with Meta's release of Llama 3.1. ¹¹Last week, we touched on Meta's AI studio, where you can make an AI agent. ¹²Two weeks ago, we talked about the release of the models and what it means to have an open source foundational model in the market now. ¹³

Dalton Anderson: Today, we're going to be breaking down half of the research paper that they created called Llama 3: Herd of Llamas. ¹⁴In this model, they break down how they went about the architecture, how they discovered what size model they need, what tokens should they use? ¹⁵Like what is the spread of context, like long context, short context, coding, non-coding, multilingual, how they went about troubleshooting issues and solving those things, root cause discovery of why their system was erroring out during training phases. ¹⁶And it is quite in-depth. ¹⁷It talks about the algorithms used as far as filtering or how they went about deduping data. ¹⁸It goes into using other research papers that they use to solve things like citations. ¹⁹They cite OpenAI several times, they cite other people and it is quite extensive and quite dense. ²⁰I read 46 pages and that includes quite a bit of graphics. ²¹So probably around an actual amount, maybe 26 to 30 pages worth of data, but actual pages, like no images. ²²And it took me about three hours because it is just so dense. ²³

Dalton Anderson: Today's focus of this first half is going to be talking about just a quick review of the herd of llamas and what that is, like the different models and the capabilities, the foundational model, like what is a foundational model. ²⁴And then we're going to be talking about the pre-training and post-training. ²⁵And then safety, those are the three things that were touched. ²⁶

The Significance of Open Source AI

Dalton Anderson: I would like to emphasize to read the paper yourself. ²⁷I'm definitely not an expert. ²⁸I'm not a scientist. ²⁹I wish I was sometimes. ³⁰It is nice to have in your back pocket to understand all the work that goes into creating something like this. ³¹And then the thought of someone providing to you all this information on a silver platter in a paper with all the links to the data they use, the papers that they used to come to their analysis, the thought process, the way that they troubleshooted, the systems that they had to create, and things that they did analysis on. ^32323232

What Meta is doing with open sourcing their research and their model is huge. Like it hasn't been done before to this scale. And it's super cool. ³³

Dalton Anderson: And I'm curious on how everything plays out because it's a great model. ³⁴It's on par with ChatGPT 4. ³⁵That being said, it's open source and the licensing for this model is open to everything. ³⁶So it's free use. ³⁷It's a free use license. ³⁸So that means that you could use it for enterprise. ³⁹You could copy, paste the code, repackage it as Dolan's LLMs. ⁴⁰And then I can make a business out of it. ⁴¹I could change the model just a little bit and make it mine. ⁴²It's just super cool. ⁴³

What Are the Llama 3.1 "Herd of Models"?

Dalton Anderson: So now we're transitioning over to the herd of models, which is the name of the paper. ⁴⁴The herd of models is three models. ⁴⁵It's the 8, the 70, and the 405 billion. ⁴⁶And those are the amount of parameters. ⁴⁷So it'd be 8 billion, 70 billion, 405 billion parameters. ⁴⁸And parameters are like weights. ⁴⁹Think about it as different scales. ⁵⁰If you had a room and you had eight scales lined up and maybe they're measuring different things on different scales basically, and they all have different weights and importance to the final outcome. ⁵¹And the final outcome is the combination of all these different weights. ⁵²

Dalton Anderson: The models are quite good. ⁵³70 billion and 8 billion are best in class, meaning that they are better than all the other models across the board. ⁵⁴The 405 billion, Meta's first foundational model is majority of the time tying the best models and then winning or tying. ⁵⁵It doesn't lose very much. ⁵⁶I think it wins or ties like 70 plus percent of the time. ⁵⁷

Building the 15 Trillion Token Training Set

Dalton Anderson: The foundational model that they did, which is the 405, was trained on 15 trillion tokens. ⁵⁸15 trillion tokens is a lot, obviously. ⁵⁹And the way that they went about creating a high quality corpus, and a corpus is just a group of text, normally attributed to unstructured data. ⁶⁰So it's the 15 trillion tokens that they trained on, but out of those 15 trillion, they cut it down. ⁶¹The 15 trillion is the final set, but they had more. ⁶²

How Data Deduplication and Cleaning Works

Dalton Anderson: They did a deduping process. ⁶³They did deduping by the URL. ⁶⁴They did deduping by the document level, which uses MinHash. ⁶⁵For line by line deduplication, their parameter was if it appeared more than six times in a bucket of 30 million documents, which is a lot of documents and that's really tight parameters. ⁶⁶Then they did domain deduplication and they also removed domains with known adult content. ⁶⁷

Dalton Anderson: What they're talking about is they scoured the internet, they built an HTML scraper that goes onto the internet, scrapes documents and text and whatever nonsense you can get a hold of and then pulls back all the information to Meta. ⁶⁸Meta runs it through this text extraction and cleaning methods. ⁶⁹So it takes the unstructured data from the websites. ⁷⁰Before you know it, you're like six sections in from the about section and then you repeat that same process for different sections of the website. ⁷¹

To extract and organize unstructured data is quite time consuming. And it takes a lot of time to test and come back and test and come back. ⁷²

Dalton Anderson: It's basically just an iterative process where you do something, test it, that didn't work. ⁷³Do something again, test it, that was somewhat good. ⁷⁴Do something, test it, that looks great. ⁷⁵All right, and then you move to the next thing. ⁷⁶It's a repetitive process of test, repeat, test, repeat until you get to the result you're looking for. ⁷⁷

Choosing an Architecture: Transformer vs. Mixture of Experts

Dalton Anderson: Now that we talked about how they took out the information that they didn't want, let's talk about the model architecture. ⁷⁸The model architecture mirrors the Llama 2 architecture, which they used a dense transformer architecture. ⁷⁹In a dense transformer, it utilizes a very simple architecture that emphasizes self-attention as a self-attention mechanism to allow the model to weigh the importance of different words in the input by generating the output. ⁸⁰

Dalton Anderson: There's a different approach that they could have, which is called the Mixture of Experts, which is basically training each section of the model as an individual expert. ⁸¹We have a coding expert, have a test expert, have a text-to-speech expert. ⁸²And then have all these different experts that you combine into one. ⁸³The Mixture of Experts is very powerful and efficient, but the issue is with training large models, it is difficult to scale up and it is prone to instability. ⁸⁴There were some decisions to use the transformer architecture that allowed Meta to focus more on the training and less on the instability of training a model of this size and scale, because it is already complicated enough. ⁸⁵

They said, "Hey, this is a super complex thing that we're trying to do right now. And the easiest thing to do is to try to minimize the level of complexity and to do that, let's pick the transformer architecture." ⁸⁶

Dalton Anderson: The transformer architecture is also the architecture that OpenAI uses that they reported to be using. ⁸⁷Overall, it just allows for less worry. ⁸⁸

Using Scaling Laws to Determine Model Size

Dalton Anderson: How do they find out how big their model should be? ⁸⁹The Meta team provided some documentation regarding the methodology that they utilize. ⁹⁰They use the scaling laws. ⁹¹So scaling laws is a way for people or companies to understand what's the optimal size of your model, given your available compute resources. ⁹²It's like a formula you plug in and you say, okay, I've got this amount of compute available and this is my training set that I have, what should I do? ⁹³The training law determined that for your foundational model, it should be 405 billion. ⁹⁴For their 70 and 8 billion model, they trained them longer than they needed to just to improve the model for inference speed. ⁹⁵

So for their 70 and their 8, they train it longer than they needed to and didn't follow the scaling laws in order to make the model faster and better. ⁹⁶

Deconstructing the Training Data Mix

Dalton Anderson: They also used the 405 to train the 70 and the 8 billion model. ⁹⁷They did so in a manner of having a model-based filtering of the data, splitting the model out and allowing for the code and data reasoning and having multilingual data in there. ⁹⁸Their final data mix is 50% of the tokens are corresponding with general knowledge, 25% of them are mathematical and or reasoning tokens, 17% are code tokens, and 8% are multilingual tokens. ⁹⁹

Understanding Tokens and Attention Mechanisms

Dalton Anderson: If you're not familiar with tokens, tokens are a unique key that is put in for an input. ¹⁰⁰Think about a token as a number. ¹⁰¹The reason why these models have tokens is because a model can't read text. ¹⁰²So if you feed in a model "the," the model doesn't know what "the" is. ¹⁰³So what you do is you tokenize your unstructured data. ¹⁰⁴And once you tokenize your data, it turns "the" into a number that the model can read. ¹⁰⁵

Dalton Anderson: They did group query attention, which is a query that they group by key value pairs. ¹⁰⁶This would allow for some computational benefits, it allows for further refinement of what's being tested and that refinement allows them to either have more computational efficiency or it's easier to decode. ¹⁰⁷They also use an attention mask, which is used to improve long contextual questions and answers. ^108108108108It prevents many documents from being accessed and used as the output when they're doing the training. ¹⁰⁹

The Pre-Training Process: Annealing and Checkpointing

Dalton Anderson: The next step would be annealing the data, which is the final step of pre-training and fine-tuning. ¹¹⁰Annealing the data is basically when they slowly starve out the model's learning rate. ¹¹¹If you don't know what learning rate is, it's how fast they allow the model to learn from the training data. ¹¹²Basically, they reduce the learning rate gradually until zero. ¹¹³Annealing the data allows for subtle adjustments to the parameters based on high-quality data. ¹¹⁴It creates a high-quality data set. ¹¹⁵They would save the checkpoint and they would save the parameters of the model. ¹¹⁶Then they would kind of compare the different changes that were made at the checkpoints of the training of the model. ¹¹⁷

The Massive Compute Power Behind Llama 3.1

Dalton Anderson: The compute that they used was 16,000 H100s, which is I think like $600 million or $800 million worth of GPUs that they are utilizing to have these AI workflows. ¹¹⁸The servers are Meta's Grand Teton and Tioga Pass AI servers. ¹¹⁹Each server contains eight GPUs and two CPUs. ¹²⁰They have a job scheduler, which is also made by Meta, which is their MASK, which is Meta's global scale training scheduler. ¹²¹They have their own file system. ¹²²Their file system is the Tectonic file distribution system, which has a throughput apparently of two to seven terabytes per second. ¹²³

78% of the interruptions were from confirmed or suspected hardware issues. ¹²⁴

Post-Training: Pruning and Refining the Data

Dalton Anderson: Post-training of Llama would refer to the data pruning section, enhancing the code capabilities, a multilingual expert training, and then the challenges of mathematical reasonings and the methodology to improve reasoning. ¹²⁵They had topic classification, quality scoring, difficulty scoring, and semantic deduplication. ¹²⁶Quality scoring was a reward score that would rate the accuracy of instructions or answers to questions, and then it would be fed through with higher scores. ¹²⁷One of the things that they did was they removed data that was excessive, like emojis and exclamation points. ¹²⁸They also removed overly apologetic phrases like, "I apologize" or "I'm sorry." ¹²⁹

Enhancing Llama 3.1's Coding Capabilities

Dalton Anderson: What I thought was pretty cool is that they made a coding agent that coded basically and then fed that data over to their model. ¹³⁰

They created this kind of expert trainer. And this was this code expert model that trained and continually trained on a large data set of code. And then this stuff is fed into Llama. ^131131131131

Dalton Anderson: One of the interesting pieces was Llama was having issues with code that wasn't as common. ¹³²Their solution was to translate code from those uncommon languages into common languages. ¹³³That allowed the model to understand if you ask about these uncommon languages. ¹³⁴It understands, okay, this is that, that is this, and then it could write the code for you. ¹³⁵

Challenges in Mathematical Reasoning

Dalton Anderson: In here, they talk about mathematical reasoning. ¹³⁶There is a big issue, lack of prompts. ¹³⁷There's not many complex math prompts that are out there. ¹³⁸For the mathematical problems, they converted the pre-training data set into a question-answer format. ¹³⁹They used the model to train itself into generating a step-by-step solution for each of these prompts that then were filtered and verified for correctness to create a high-quality training set. ¹⁴⁰Then they had human feedback of the model and prompted it to correct its mistakes, improving the ability to learn from its errors. ¹⁴¹

Solving for Long-Context Handling

Dalton Anderson: The next challenge that they had to solve was solving for long contextual handling. ¹⁴²There is a lack of prompts, obviously. ¹⁴³It's difficult to obtain high-quality long text, long-context information with human annotation because it's super time-consuming. ¹⁴⁴What they did for a long context is they heavily relied on synthetic generation. ¹⁴⁵They did something similar for improving the mathematical reasoning where they did question-answer and then they added in summaries. ¹⁴⁶

Ensuring Responsible AI Through Safety Testing

Dalton Anderson: Safety is something that's brought up a lot. ¹⁴⁷These multimodal models are becoming very good in a short amount of time, so there's a large emphasis on safety, especially when you're doing it open source. ¹⁴⁸There is a whole team of people at Meta where they do uplift testing, which is to evaluate whether a new technology like this Llama 3.1 model could allow individuals or groups to perform tasks, particularly the ones that could cause risk. ¹⁴⁹They also do red teaming, which is a group of people within the company that are trying to break the model and make the model do things that it shouldn't be doing, like teach someone how to make a chemical weapon. ¹⁵⁰

Dalton Anderson: Their overall findings were they didn't exhibit any significant vulnerabilities. ¹⁵¹However, there were some concerns that were identified from internal and external audits. ^152152152Insecure code generation, prompt injection, and they also have phishing attacks on there. ^{153153153153153153153153153}

Dalton Anderson: I hope that when you're discussing these things with yourself internally or with your peers, you have a better understanding on what these models are doing and how are they built under the hood. ¹⁵⁴

Once you know how it's trained, then you know how to utilize it. ¹⁵⁵

Dalton Anderson: If you're doing the same things it's trained to do, then you should get pretty good results. ¹⁵⁶I think that's important, right? ¹⁵⁷At the end of the day, we're all trying to save some time. ¹⁵⁸

RESOURCES MENTIONED

Llama 3.1 Research Paper: "Llama 3: Herd of Llamas" ¹⁵⁹
Meta AI ¹⁶⁰
OpenAI ¹⁶¹
Llama 2 ¹⁶²
Brave Browser ¹⁶³
Nvidia H100 GPUs ¹⁶⁴
Meta's Grand Teton and Tioga Pass AI servers ¹⁶⁵
Tectonic File Distribution System ¹⁶⁶
MASK (Meta’s Global-Scale Training Scheduler) ¹⁶⁷

INDEX OF CONCEPTS

Llama 3.1, Meta, Meta AI Studio, open source, foundational model, Llama 3: Herd of Llamas, OpenAI, parameters, tokens, 8B model, 70B model, 405B model, corpus, deduplication, MinHash, HTML scraper, unstructured data, model architecture, Llama 2, dense transformer architecture, self-attention mechanism, Mixture of Experts (MoE), scaling laws, inference speed, Group Query Attention, attention mask, auto-regressive decoder, annealing, learning rate, checkpoint averaging, FLOPs, Nvidia H100, Grand Teton servers, Tioga Pass servers, NVLink, MASK (Meta’s Global-Scale Training Scheduler), Tectonic File Distribution System, petabytes, solid-state drives, post-training, data pruning, topic classification, quality scoring, synthetic data generation, code interpreter, multilingual training, mathematical reasoning, long-context handling, knowledge probing, safety, uplift testing, red teaming, insecure code generation, prompt injection, phishing attacks, Brave browser