Meta's Llama 3.1: Inside the AI Research Paper
TL;DR
Meta's Llama 3.1 paper reveals a focus on a stable transformer architecture over Mixture of Experts and massive in-house infrastructure to train its models on 15 trillion tokens. #VentureStep #AI #Llama3
INTRODUCTION
The release of powerful AI models like Meta's Llama 3.1 often focuses on their impressive capabilities, but what's happening under the hood? Understanding the architecture, training methodologies, and safety protocols is crucial for entrepreneurs and developers looking to leverage these tools effectively. The research papers accompanying these models are treasure troves of information, but their density can be a significant barrier to entry. 1111
In this episode of VentureStep, host Dalton Anderson dives deep into Meta's research paper, titled "Herd of Llamas," to demystify the complex processes behind their latest model. 2The discussion breaks down the first half of the paper, focusing on the model's architecture, the extensive pre-training and post-training processes, and the critical safety measures Meta implemented. 3
From choosing a stable transformer architecture to meticulously cleaning 15 trillion tokens of data, this analysis reveals the sheer scale and complexity of building a state-of-the-art large language model. 44 It's a look at the deliberate decisions, the immense hardware infrastructure, and the iterative problem-solving required to push the boundaries of artificial intelligence.
KEY TAKEAWAYS
- Meta prioritized stability by choosing a dense transformer architecture over the more complex Mixture of Experts (MoE) model for Llama 3.1. 5
- The 405B foundational model was trained on 15 trillion tokens, which were meticulously cleaned using a multi-step deduplication process across URLs, documents, and domains. 66666
- Meta developed extensive in-house infrastructure, including a custom job scheduler and file system, to manage the immense complexity and hardware challenges of training at scale. 77
- To enhance capabilities in niche areas like math and coding, Meta created specialized "expert" models to generate high-quality, synthetic training data for Llama 3.1. 888
- The model's safety was rigorously tested through internal red teaming and uplift testing to identify and mitigate risks like insecure code generation and prompt injection. 99999999
FULL CONVERSATION
Dalton Anderson: Welcome to VentureStep podcast, where we discuss entrepreneurship, industry trends, and the occasional book review. 10We're continuing our series with Meta's release of Llama 3.1. 11Last week, we touched on Meta's AI studio, where you can make an AI agent. 12Two weeks ago, we talked about the release of the models and what it means to have an open source foundational model in the market now. 13
Dalton Anderson: Today, we're going to be breaking down half of the research paper that they created called Llama 3: Herd of Llamas. 14In this model, they break down how they went about the architecture, how they discovered what size model they need, what tokens should they use? 15Like what is the spread of context, like long context, short context, coding, non-coding, multilingual, how they went about troubleshooting issues and solving those things, root cause discovery of why their system was erroring out during training phases. 16And it is quite in-depth. 17It talks about the algorithms used as far as filtering or how they went about deduping data. 18It goes into using other research papers that they use to solve things like citations. 19They cite OpenAI several times, they cite other people and it is quite extensive and quite dense. 20I read 46 pages and that includes quite a bit of graphics. 21So probably around an actual amount, maybe 26 to 30 pages worth of data, but actual pages, like no images. 22And it took me about three hours because it is just so dense. 23
Dalton Anderson: Today's focus of this first half is going to be talking about just a quick review of the herd of llamas and what that is, like the different models and the capabilities, the foundational model, like what is a foundational model. 24And then we're going to be talking about the pre-training and post-training. 25And then safety, those are the three things that were touched. 26
The Significance of Open Source AI
Dalton Anderson: I would like to emphasize to read the paper yourself. 27I'm definitely not an expert. 28I'm not a scientist. 29I wish I was sometimes. 30It is nice to have in your back pocket to understand all the work that goes into creating something like this. 31And then the thought of someone providing to you all this information on a silver platter in a paper with all the links to the data they use, the papers that they used to come to their analysis, the thought process, the way that they troubleshooted, the systems that they had to create, and things that they did analysis on. 32323232
What Meta is doing with open sourcing their research and their model is huge. Like it hasn't been done before to this scale. And it's super cool. 33
Dalton Anderson: And I'm curious on how everything plays out because it's a great model. 34It's on par with ChatGPT 4. 35That being said, it's open source and the licensing for this model is open to everything. 36So it's free use. 37It's a free use license. 38So that means that you could use it for enterprise. 39You could copy, paste the code, repackage it as Dolan's LLMs. 40And then I can make a business out of it. 41I could change the model just a little bit and make it mine. 42It's just super cool. 43
What Are the Llama 3.1 "Herd of Models"?
Dalton Anderson: So now we're transitioning over to the herd of models, which is the name of the paper. 44The herd of models is three models. 45It's the 8, the 70, and the 405 billion. 46And those are the amount of parameters. 47So it'd be 8 billion, 70 billion, 405 billion parameters. 48And parameters are like weights. 49Think about it as different scales. 50If you had a room and you had eight scales lined up and maybe they're measuring different things on different scales basically, and they all have different weights and importance to the final outcome. 51And the final outcome is the combination of all these different weights. 52
Dalton Anderson: The models are quite good. 5370 billion and 8 billion are best in class, meaning that they are better than all the other models across the board. 54The 405 billion, Meta's first foundational model is majority of the time tying the best models and then winning or tying. 55It doesn't lose very much. 56I think it wins or ties like 70 plus percent of the time. 57
Building the 15 Trillion Token Training Set
Dalton Anderson: The foundational model that they did, which is the 405, was trained on 15 trillion tokens. 5815 trillion tokens is a lot, obviously. 59And the way that they went about creating a high quality corpus, and a corpus is just a group of text, normally attributed to unstructured data. 60So it's the 15 trillion tokens that they trained on, but out of those 15 trillion, they cut it down. 61The 15 trillion is the final set, but they had more. 62
How Data Deduplication and Cleaning Works
Dalton Anderson: They did a deduping process. 63They did deduping by the URL. 64They did deduping by the document level, which uses MinHash. 65For line by line deduplication, their parameter was if it appeared more than six times in a bucket of 30 million documents, which is a lot of documents and that's really tight parameters. 66Then they did domain deduplication and they also removed domains with known adult content. 67
Dalton Anderson: What they're talking about is they scoured the internet, they built an HTML scraper that goes onto the internet, scrapes documents and text and whatever nonsense you can get a hold of and then pulls back all the information to Meta. 68Meta runs it through this text extraction and cleaning methods. 69So it takes the unstructured data from the websites. 70Before you know it, you're like six sections in from the about section and then you repeat that same process for different sections of the website. 71
To extract and organize unstructured data is quite time consuming. And it takes a lot of time to test and come back and test and come back. 72
Dalton Anderson: It's basically just an iterative process where you do something, test it, that didn't work. 73Do something again, test it, that was somewhat good. 74Do something, test it, that looks great. 75All right, and then you move to the next thing. 76It's a repetitive process of test, repeat, test, repeat until you get to the result you're looking for. 77
Choosing an Architecture: Transformer vs. Mixture of Experts
Dalton Anderson: Now that we talked about how they took out the information that they didn't want, let's talk about the model architecture. 78The model architecture mirrors the Llama 2 architecture, which they used a dense transformer architecture. 79In a dense transformer, it utilizes a very simple architecture that emphasizes self-attention as a self-attention mechanism to allow the model to weigh the importance of different words in the input by generating the output. 80
Dalton Anderson: There's a different approach that they could have, which is called the Mixture of Experts, which is basically training each section of the model as an individual expert. 81We have a coding expert, have a test expert, have a text-to-speech expert. 82And then have all these different experts that you combine into one. 83The Mixture of Experts is very powerful and efficient, but the issue is with training large models, it is difficult to scale up and it is prone to instability. 84There were some decisions to use the transformer architecture that allowed Meta to focus more on the training and less on the instability of training a model of this size and scale, because it is already complicated enough. 85
They said, "Hey, this is a super complex thing that we're trying to do right now. And the easiest thing to do is to try to minimize the level of complexity and to do that, let's pick the transformer architecture." 86
Dalton Anderson: The transformer architecture is also the architecture that OpenAI uses that they reported to be using. 87Overall, it just allows for less worry. 88
Using Scaling Laws to Determine Model Size
Dalton Anderson: How do they find out how big their model should be? 89The Meta team provided some documentation regarding the methodology that they utilize. 90They use the scaling laws. 91So scaling laws is a way for people or companies to understand what's the optimal size of your model, given your available compute resources. 92It's like a formula you plug in and you say, okay, I've got this amount of compute available and this is my training set that I have, what should I do? 93The training law determined that for your foundational model, it should be 405 billion. 94For their 70 and 8 billion model, they trained them longer than they needed to just to improve the model for inference speed. 95
So for their 70 and their 8, they train it longer than they needed to and didn't follow the scaling laws in order to make the model faster and better. 96
Deconstructing the Training Data Mix
Dalton Anderson: They also used the 405 to train the 70 and the 8 billion model. 97They did so in a manner of having a model-based filtering of the data, splitting the model out and allowing for the code and data reasoning and having multilingual data in there. 98Their final data mix is 50% of the tokens are corresponding with general knowledge, 25% of them are mathematical and or reasoning tokens, 17% are code tokens, and 8% are multilingual tokens. 99
Understanding Tokens and Attention Mechanisms
Dalton Anderson: If you're not familiar with tokens, tokens are a unique key that is put in for an input. 100Think about a token as a number. 101The reason why these models have tokens is because a model can't read text. 102So if you feed in a model "the," the model doesn't know what "the" is. 103So what you do is you tokenize your unstructured data. 104And once you tokenize your data, it turns "the" into a number that the model can read. 105
Dalton Anderson: They did group query attention, which is a query that they group by key value pairs. 106This would allow for some computational benefits, it allows for further refinement of what's being tested and that refinement allows them to either have more computational efficiency or it's easier to decode. 107They also use an attention mask, which is used to improve long contextual questions and answers. 108108108108It prevents many documents from being accessed and used as the output when they're doing the training. 109
The Pre-Training Process: Annealing and Checkpointing
Dalton Anderson: The next step would be annealing the data, which is the final step of pre-training and fine-tuning. 110Annealing the data is basically when they slowly starve out the model's learning rate. 111If you don't know what learning rate is, it's how fast they allow the model to learn from the training data. 112Basically, they reduce the learning rate gradually until zero. 113Annealing the data allows for subtle adjustments to the parameters based on high-quality data. 114It creates a high-quality data set. 115They would save the checkpoint and they would save the parameters of the model. 116Then they would kind of compare the different changes that were made at the checkpoints of the training of the model. 117
The Massive Compute Power Behind Llama 3.1
Dalton Anderson: The compute that they used was 16,000 H100s, which is I think like $600 million or $800 million worth of GPUs that they are utilizing to have these AI workflows. 118The servers are Meta's Grand Teton and Tioga Pass AI servers. 119Each server contains eight GPUs and two CPUs. 120They have a job scheduler, which is also made by Meta, which is their MASK, which is Meta's global scale training scheduler. 121They have their own file system. 122Their file system is the Tectonic file distribution system, which has a throughput apparently of two to seven terabytes per second. 123
78% of the interruptions were from confirmed or suspected hardware issues. 124
Post-Training: Pruning and Refining the Data
Dalton Anderson: Post-training of Llama would refer to the data pruning section, enhancing the code capabilities, a multilingual expert training, and then the challenges of mathematical reasonings and the methodology to improve reasoning. 125They had topic classification, quality scoring, difficulty scoring, and semantic deduplication. 126Quality scoring was a reward score that would rate the accuracy of instructions or answers to questions, and then it would be fed through with higher scores. 127One of the things that they did was they removed data that was excessive, like emojis and exclamation points. 128They also removed overly apologetic phrases like, "I apologize" or "I'm sorry." 129
Enhancing Llama 3.1's Coding Capabilities
Dalton Anderson: What I thought was pretty cool is that they made a coding agent that coded basically and then fed that data over to their model. 130
They created this kind of expert trainer. And this was this code expert model that trained and continually trained on a large data set of code. And then this stuff is fed into Llama. 131131131131
Dalton Anderson: One of the interesting pieces was Llama was having issues with code that wasn't as common. 132Their solution was to translate code from those uncommon languages into common languages. 133That allowed the model to understand if you ask about these uncommon languages. 134It understands, okay, this is that, that is this, and then it could write the code for you. 135
Challenges in Mathematical Reasoning
Dalton Anderson: In here, they talk about mathematical reasoning. 136There is a big issue, lack of prompts. 137There's not many complex math prompts that are out there. 138For the mathematical problems, they converted the pre-training data set into a question-answer format. 139They used the model to train itself into generating a step-by-step solution for each of these prompts that then were filtered and verified for correctness to create a high-quality training set. 140Then they had human feedback of the model and prompted it to correct its mistakes, improving the ability to learn from its errors. 141
Solving for Long-Context Handling
Dalton Anderson: The next challenge that they had to solve was solving for long contextual handling. 142There is a lack of prompts, obviously. 143It's difficult to obtain high-quality long text, long-context information with human annotation because it's super time-consuming. 144What they did for a long context is they heavily relied on synthetic generation. 145They did something similar for improving the mathematical reasoning where they did question-answer and then they added in summaries. 146
Ensuring Responsible AI Through Safety Testing
Dalton Anderson: Safety is something that's brought up a lot. 147These multimodal models are becoming very good in a short amount of time, so there's a large emphasis on safety, especially when you're doing it open source. 148There is a whole team of people at Meta where they do uplift testing, which is to evaluate whether a new technology like this Llama 3.1 model could allow individuals or groups to perform tasks, particularly the ones that could cause risk. 149They also do red teaming, which is a group of people within the company that are trying to break the model and make the model do things that it shouldn't be doing, like teach someone how to make a chemical weapon. 150
Dalton Anderson: Their overall findings were they didn't exhibit any significant vulnerabilities. 151However, there were some concerns that were identified from internal and external audits. 152152152Insecure code generation, prompt injection, and they also have phishing attacks on there. 153153153153153153153153153
Dalton Anderson: I hope that when you're discussing these things with yourself internally or with your peers, you have a better understanding on what these models are doing and how are they built under the hood. 154
Once you know how it's trained, then you know how to utilize it. 155
Dalton Anderson: If you're doing the same things it's trained to do, then you should get pretty good results. 156I think that's important, right? 157At the end of the day, we're all trying to save some time. 158
RESOURCES MENTIONED
- Llama 3.1 Research Paper: "Llama 3: Herd of Llamas" 159
- Meta AI 160
- OpenAI 161
- Llama 2 162
- Brave Browser 163
- Nvidia H100 GPUs 164
- Meta's Grand Teton and Tioga Pass AI servers 165
- Tectonic File Distribution System 166
- MASK (Meta’s Global-Scale Training Scheduler) 167
INDEX OF CONCEPTS
Llama 3.1, Meta, Meta AI Studio, open source, foundational model, Llama 3: Herd of Llamas, OpenAI, parameters, tokens, 8B model, 70B model, 405B model, corpus, deduplication, MinHash, HTML scraper, unstructured data, model architecture, Llama 2, dense transformer architecture, self-attention mechanism, Mixture of Experts (MoE), scaling laws, inference speed, Group Query Attention, attention mask, auto-regressive decoder, annealing, learning rate, checkpoint averaging, FLOPs, Nvidia H100, Grand Teton servers, Tioga Pass servers, NVLink, MASK (Meta’s Global-Scale Training Scheduler), Tectonic File Distribution System, petabytes, solid-state drives, post-training, data pruning, topic classification, quality scoring, synthetic data generation, code interpreter, multilingual training, mathematical reasoning, long-context handling, knowledge probing, safety, uplift testing, red teaming, insecure code generation, prompt injection, phishing attacks, Brave browser