Powered by RND
PodcastsTechnologyInterconnects
Listen to Interconnects in the App
Listen to Interconnects in the App
(36,319)(250,152)
Save favorites
Alarm
Sleep timer

Interconnects

Podcast Interconnects
Nathan Lambert
Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, an...

Available Episodes

5 of 77
  • Why reasoning models will generalize
    This post is early to accommodate some last minute travel on my end!The new models trained to express extended chain of thought are going to generalize outside of their breakthrough domains of code and math. The “reasoning” process of language models that we use today is chain of thought reasoning. We ask the model to work step by step because it helps it manage complexity, especially in domains where the answer requires precision across multiple specific tokens. The domains where chain of thought (CoT) is most useful today are code, mathematics, and other “reasoning” tasks. These are the domains where models like o1, R1, Gemini-Thinking, etc. were designed for.Different intelligences reason in different ways that correspond to how they store and manipulate information. Humans compress a lifetime of experience into our spectacular, low-power brains that draw on past experience almost magically. The words that follow in this blog are also autoregressive, like the output of a language model, but draw on hours and hours of background processing as I converge on this argument.Language models, on the other hand, are extremely general and do not today have architectures (or use-cases) that continually re-expose them to relevant problems and fold information back in a compressed form. Language models are very large, sophisticated, parametric probability distributions. All of their knowledge and information processing power is stored in the raw weights. Therein, they need a way of processing information that matches this. Chain of thought is that alignment.Chain of thought reasoning allows information to be naturally processed in smaller chunks, allowing the large, brute force probability distribution to work one token at a time. Chain of thought, while allowing more compute per important token, also allows the models to store intermediate information in their context window without needing explicit recurrence.Recurrence is required for reasoning and this can either happen in the parameter or state-space. Chain of thoughts with transformers handles all of this in the state-space of the problems. The humans we look at as the most intelligent have embedded information directly in the parameters of our brains that we can draw on.Here is the only assumption of this piece — chain of thought is a natural fit for language models to “reason” and therefore one should be optimistic about training methods that are designed to enhance it generalizing to many domains. By the end of 2025 we should have ample evidence of this given the pace of the technological development.If the analogies of types of intelligence aren’t convincing enough, a far more practical way to view the new style of training is a method that teaches the model to be better at allocating more compute to harder problems. If the skill is compute allocation, it is fundamental to the models handling a variety of tasks. Today’s reasoning models do not solve this perfectly, but they open the door for doing so precisely.The nature of this coming generalization is not that these models are one size fits all, best in all cases: speed, intelligence, price, etc. There’s still no free lunch. A realistic outcome for reasoning heavy models in the next 0-3 years is a world where:* Reasoning trained models are superhuman on tasks with verifiable domains, like those with initial progress: Code, math, etc.* Reasoning trained models are well better in peak performance than existing autoregressive models in many domains we would not expect and are not necessarily verifiable.* Reasoning trained models are still better in performance at the long-tail of tasks, but worse in cost given the high inference costs of long-context.Many of the leading figures in AI have been saying for quite some time that powerful AI is going to be “spikey" when it shows up — meaning that the capabilities and improvements will vary substantially across domains — but encountering this reality is very unintuitive.Some evidence for generalization of reasoning models already exists.OpenAI has already published multiple safety-oriented research projects with their new reasoning models in Deliberative Alignment: Reasoning Enables Safer Language Models and Trading Inference-Time Compute for Adversarial Robustness. These papers show their new methods can be translated to various safety domains, i.e. model safety policies and jailbreaking. The deliberative alignment paper shows them integrating a softer reward signal into the reasoning training — having a language model check how the safety policies apply to outputs.An unsurprising quote from the deliberative alignment release related to generalization:we find that deliberative alignment enables strong generalization to out-of-distribution safety scenarios.Safety, qualitatively, is very orthogonal to traditional reasoning problems. Safety is very subjective to the information provided and subtle context, where math and coding problems are often about many small, forward processing steps towards a final goal. More behaviors will fit in between those.This generative verifier for safety is not a ground truth signal and could theoretically be subject to reward hacking, but it was avoided. Generative verifiers will be crucial to expanding this training to countless domains — they’re easy to use and largely a new development. The field of LLM-as-a-judge (and related synthetic data pipelines) only really became stable with models at the level of GPT-4. Reasoning models trained as a judge are a very natural fit because the exact token for a predicted reward or ranking is crucial — CoT is essential. All of the progress here relies on continued progress on both generators and verifiers. o1 et al. were likely trained with mostly explicit, code verifiers. They spawned far more powerful generators, which will enable new types of verifiers. Then, we can train better models (and so on).Onto another example of unexpected performance of new reasoning trained models. DeepSeek-R1, the new open-weight o1 replication has been showing up at the top of many random benchmarks as top overall, above Claude 3.5 Sonnet, Gemini, and GPT-4o, and alongside o1. Examples include a creative writing and humor leaderboard or the brand-new, extremely challenging benchmark from the Center for AI Safety and Scale AI — Humanity’s Last Exam. Oh, and yes, it’s best on both accuracy and the new metric “calibration error” which is designed to have the model express its own uncertainty. Calibration is a long-sought behavior in traditional LMs and turns out maybe reasoning training helps it?A lot of my friends find o1-pro to be clearly the most useful AI model in their daily workflows (one example here and a similar R1 example here). ChatBotArena has all of the new models, from o1, Gemini-Thinking, and R1 as some of the top models these organizations have in the best “normal use” evaluation the AI community has. These reasoning models are definitely absorbing the other lessons learned in post-training across the AI industry.The explosion of R1 caused arguably the biggest general awareness of AI moment since the original ChatGPT. DeepSeek’s App has been the number one overall free app in the U.S. and non-technical users are getting meaningful value out of seeing the reasoning process. What was a niche training process is bringing many more types of benefits than expected.All of this is just on “day 1” of this technology. Reasoning models are going to proceed at a rate far, far faster than most expect.These models will not be state-of-the-art on every domain, but probably far more than you expect. Language models are a complex technology and they will never be one size fits all, but the ground is being reshaped under us.Especially, where the standard models match the reasoning models abilities, you’ll be paying way more for the same performance. At the same time, so many domains are going to be open to the “if you pay a little bit more, the reasoning model will get you a bit more performance,” which will accrue so much value over time.These are trade-offs that many in the AI industry see at face value. Many ask where Anthropic’s reasoning model is, but they may never explicitly have one. Before o1 launched, Claude was already using extra tokens hidden from the user to improve the quality of responses. Anthropic CEO Dario Amodei commented on their approach in an interview with Joanna Stern of the WSJ recently:To say a little about reasoning models, our perspective is a little different, which is that there’s been this whole idea of reasoning models and test-time compute as if they’re a totally different way of doing things. That’s not our perspective. We see it more as a continuous spectrum — the ability for models to think, reflect on their own thinking, and ultimately produce a result. If you use Sonnet 3.5, sometimes it already does that to some extent. But I think the change we’re going to see is a larger-scale use of reinforcement learning, and when you train the model with reinforcement learning, it starts to think and reflect more.It’s not like reasoning or test-time compute — or whatever it’s called — is a totally new method. It’s more like an emergent property, a consequence of training the model in an outcome-based way at a larger scale. I think that will lead to something that continuously interpolates between reasoning and other tasks, fluidly combining reasoning with everything else models do.As you’ve said, we’ve often focused on making sure using the model is a smooth experience, allowing people to get the most out of it. I think with reasoning models, we may take a similar approach and do something different from what others are doing.The newest Claude 3.5 Sonnet models are very likely already trained to some extent with RL on verifiable outcomes. Just days before o1 was launched, Claude’s behavior of “I’m thinking about that” was the biggest indicator we had of consumer companies trading more compute for better responses. Anthropic hasn’t shifted their strategy here, and you can decide how much weight you want to put in their CEO’s recent comments.Interconnects is a reader-supported publication. Consider becoming a subscriber.The techniques are here to stay, and it took revolutionary new models to show us that.Like many new technologies, we needed to be shown what was possible, and then it can be folded back into normal experience. o1 was this breakthrough, and the benefits of reasoning training will now expand out into all of the AI products we are using day and night.To end, I leave you with a quote from the DeepSeek R1 paper, where the authors reflect on their experience with the model(s):One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy.Thanks to Ross Taylor and Hamish Ivison, for discussions that helped inspire this post. Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    11:37
  • Interviewing OLMo 2 leads: Open secrets of training language models
    We're here to share the story of building our Open Language Models (OLMos) and what we improved to build the OLMo 2 7B/13B model that is competitive with the Llama 3.1 8B model. This is all about building an effective, small language modeling team that can share all it learns with the scientific community. Dirk, Luca, and Kyle are some of the people I learn the most from and have more knowledge (and entertainment) to share than we have time. Some questions were pulled from Twitter, but please comment or get in touch if you want us to cover anything in the future episode(s)!Main topics:* Pretraining efficiency and our quest for stability after a not-so-secret failed 70B run early in 2024,* What the role of OLMo is in the broader AI landscape and how that is, or is not, changing,* Many little decisions that going into building language models and their teams (with a focus on NOT post-training, given I already talk about that a ton).Play with the models we build here: playground.allenai.org/For more history of open language models (OLMos) on Interconnects, see my first post on OLMo, my coverage of OLMoE, OLMo 2, and why I build open language models. If you have more questions or requests, please let us know (especially the researchers out there) and this can be one of N, rather than a one off celebration.Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.ContactsDirk Groeneveld — https://x.com/mechanicaldirk // https://bsky.app/profile/mechanicaldirk.bsky.socialKyle Lo — https://x.com/kylelostat // https://bsky.app/profile/kylelo.bsky.socialLuca Soldaini — https://twitter.com/soldni // https://bsky.app/profile/soldaini.netGeneral OLMo contact — [email protected] / models / codebases discussed* OLMo 2 paper* OLMo 1 paper* OPT models and talk from Susan Zhang* BLOOM* Red Pajama V1 Dataset* Falcon LLM * C4: Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach * Maximal Update Parametrization (muP) is from Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer * Spike No More: Stabilizing the Pre-training of Large Language Models * LLM360: Towards Fully Transparent Open-Source LLMs — Amber model* EfficientNet * MegaBlocks * A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Kyle said Hitchhikers)* Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models ChaptersChapters: Here is a list of major topics covered in the podcast, with timestamps for when the discussion starts:* [00:00:00] Introduction* [00:02:45] Early history of the OLMo project* [00:15:27] The journey to stability* [00:25:00] The evolving role of OLMo and pretraining research* [00:29:00] Pretraining Q&A (µP, scaling laws, MoE, etc.)* [00:40:40] How to think about pretraining data work* [00:54:30] Role of pre-training vs mid training vs post-training* [01:02:19] Release strategy and wrapping upTranscriptThis is generated by AI and lightly edited for clarity. Particularly, the attribution per-speaker was poor on this time around.Nathan Lambert [00:00:07]: Hey, welcome back to Interconnects. In this interview, we're bringing one that I've hinted at for a while, which is interviewing some of the other leads on the OLMo team at AI2. So essentially, this covers the story of OLMo from its early days where we got our compute, kind of our path to stability and some failed runs along the way, the role of OLMo and the broader AI ecosystem, and really just a very long tale of technical details and decision making and considerations that you have when actually training language models that you're trying to have at the frontier of performance relative to peers like Llama, etc. This is a fun one. There's less post-training than normal because this is me interviewing some other co-leads at the Allen Institute for AI. So there's three people in addition to me, which is Dirk Groeneveld, who is the lead of training, handles most of engineering, Kyle Lo, and Luca Soldaini, who are the data leads. So we have a pre-training engineering lead and two data leads with me who has done a lot of the post-training. This is just a part of the team. And I hope you enjoy this one. We can do more of these and bear with the fact that I'm still expanding my podcasting tech equipment. But I think the audio is definitely good enough and enjoy this episode with me, Kyle, Dirk, and Luca.Hey, everyone. Welcome to the AI2 office. We're finally talking more about some of our OLMo things. Too much work to do to actually get all the... the information we want to share out into the world. So I'm here with Dirk, Kyle, and Luca. We can also talk so people identify your voices so people are not all on video. Hi, I'm Dirk.Dirk Groeneveld [00:02:01]: I am the lead of the pre-training part of OLMo.Kyle Lo: Hi, I'm Kyle. I work on data.Luca Soldaini [00:02:08]: Hello, Luca. Also work on data with Kyle.Nathan Lambert [00:02:13]: Okay, so we're kind of going to maybe go through some of the story of OLMo to start. And then just get into as many nerdy details until we get tired of OLMo 2. Which, in my state, this will probably be mostly about pre-training. You can ask me post-training questions as well. But I'm not going to sit here and be like, ask myself questions that I'm not going to answer. Because that is an absolutely ridiculous thing. You can ask me one question. Okay. One question. It's like, why shouldn't you post-training with all the compute?Nathan Lambert [00:02:45]: But I wasn't here for when OLMo actually started. So I think it'd be good to tell people, I mean, like, broadly what AI2 was like at the time, what language modeling was like at the time, what it may or may not have been risky.Kyle Lo [00:03:01]: Yeah, you should probably get this.Dirk Groeneveld [00:03:03]: Yeah, I think it all started in the fall of 2022.Dirk Groeneveld [00:03:10]: We were talking to AMD at the time about some sort of collaboration. We're scoping out some stuff. And at the time, we wanted to take the Bloom model. And put 300 billion extra tokens in. And we wrote up a proposal and we sent it to AMD and it disappeared into a black hole. And we never heard from them again. And then ChatGPT came out a couple months after that. And suddenly everybody was very excited. And two, maybe one month after that, AMD came back to us and said, now let's do it. And that kicked off a very busy period for us. At least the three of us were involved at the time. Plus some of us. Some more people trying to scope out exactly what the project would be. Putting 300 billion tokens into Bloom wasn't that cool anymore. The field had moved on. So we needed to find something else that would work both for us and for AMD.Dirk Groeneveld [00:04:07]: And that's exactly what we did. We figured it out. We figured out who would be on the team, how exactly to do it. We had to get the data from all of that stuff and then started working on it.Luca Soldaini [00:04:16]: I think it was, let's look it up. And the official birthday of all of us. Almost is February 2nd, 2023. That's when we had like a big sort of half day. Summit workshop and a bunch of researchers self-organized a long discussion. I'm foreseeing maybe like 40, 50 of us try to scope down a potential language model project at AI2.Kyle Lo [00:04:48]: Yeah, it was also extremely bottom. Up because we were all like, nobody, it was not on anyone's radar. We were working on, everyone's working on different projects that we had promised for the end of the year. This was very much just like a side gig for us. We had no compute other than this mysterious AMD GPUs that just came. It was like, oh, it's possible. And everyone was just like, yeah, I'll work on this on the side. Let's just start hacking together some stuff.Nathan Lambert [00:05:14]: How far along the line until you decided on 7B? Like, were these things obvious at the time?Luca Soldaini [00:05:20]: I think the size of it. This is where Llama's size was. Yeah, we started with seven because seven was the smallest Llama size. This was Llama one. Yeah, Llama one was like first couple months of 2023. Yeah, we started, we started scoping before Llama one. And then when Llama one came out, it made sense to have a configuration that was just sort of close to what they were doing. So it's not too much reinventing. I think seven was.Dirk Groeneveld [00:05:52]: Yeah, I mean, I think the original scope was recreate Llama one, which would be a 7B at 1.4 million tokens. What were we staring at? OPT.Kyle Lo [00:06:03]: We were staring at OPT also, right? During around that time.Dirk Groeneveld [00:06:07]: For inspiration. Yeah. And for what not to do in many cases. Was OPT even like in the many tokens regime or was that still like when people did the booms and booms?Luca Soldaini [00:06:18]: I think OPT and booms were.Luca Soldaini [00:06:22]: They were not, they were not over trained at the end were both a scope to Chinchilla that they both had extensive logs and so they were very useful because both of them have hundreds of pages of like, whatever can go wrong during pre-training. Yeah. I mean, OPT was amazing as a resource for figuring out, you know, we knew nothing, so we needed to know what's important. And yeah, I remember there's also avoidance and so on. There's that. It's like Susan has this talk.Dirk Groeneveld: I'll come load parallels of training OPT and yeah, I think the original ones, I always feel it's kind of a shame because the OPT models are not very good, but, but they were first, like they figured all that stuff out for the first time. I have huge amounts of respect for that.Nathan Lambert [00:07:11]: And what's the like open source angle thing at the time, or like, had you already identified that there was no open pre-trained data sets for these models?Kyle Lo There definitely wasn't any open pre-trained data sets. I think we were basically looking at. The gopher paper that had most documentation and then Llama one had enough documentation about what data sources were using, where we were like, okay, let's try to reconstruct what it was. And I think roughly around the same time, Red Pajama V1 and then shortly after it was like Falcon, Falcon, the first Falcon, we were all kind of concurrent works at the time, but basically starting from, I don't know, Grab, Common Crawl, grab a bunch of sources to try our best.Luca Soldaini [00:07:50]: The funny thing, like we had conversation of like. Like, uh, there was like, boy, it would be good if we didn't have to do the data. This would be one fewer thing to do, but at the time, like even when, uh, Falcon dropped, they released like a small preview that wouldn't match like the token budget that we wanted for a training run. So it was not even like, you know, it was good work and like, oh, maybe we just switched to this one. And then we quickly arise, not, not big enough for the two trillion. So I think it was like, maybe. Yeah. Yeah.Dirk Groeneveld [00:08:22]: I mean, we did the C4 data set way before any of this. Um, and so my first idea for how to do data was to just run C4, but on all the Common Crawl, um, instead of just whatever the most recent one was at the time. And I actually started writing a repo for that, but then ended up not doing it. This is the C5 repo. Yeah.Nathan Lambert This was C4's side of data cleaning practices.Dirk Groeneveld Yes. That's exactly a re-implementation of C4. And, um, for it to touch it, we'd run on slightly different hardware, um, with more dApps and that was, that was going to be the entire story until we found we could do better.Nathan Lambert Yeah. And, um, for general timelining, I joined pretty much like almost 7B was, I think mostly done training or wrapping up pre-training and the like instruction tuning at the time was like basic SFT with a sprinkle of DPO. Yeah. So I think a lot of that story gets cut. Compressed. Like I'm guessing the actual pre-training happened in like the second half of the year, mostly. So it's a lot of prep to get a language modeling system to exist. Yeah.Luca Soldaini [00:09:32]: I think we handed off the one of Dolma. So the data set that we used for pre-training is like end of June, I think, 2023. Grab Common Crawl, end of March. Yeah. So all the source acquisition was March, April. Let's see March and then yeah, a few months. There.Nathan Lambert [00:09:52]: Um, if someone wants to do the same thing today, which is like, we should train a language model, how much faster would it be to like, is OLMo actually making that much of like, would it be a week with OLMo stuff now, or would it still take a lot of time to set this up?Luca Soldaini [00:10:07]: I think if, if you want to, um, if you want to train exactly on OLMo data, you know, data, it's much faster, um, training, I think it requires a little bit more finesse and dirt. Yeah.Dirk Groeneveld [00:10:23]: If someone gives you a cluster to, to run on, just figuring out the mechanics of getting your thing to run, just so setting all the environment variables and having the drivers loaded and so on, it might take you a week or so if you're, if you've done that kind of thing before. Um, so that's very different, but you can take a trainer that already works and just, just use it.Luca Soldaini [00:10:45]: Um, it really depends like where, where you start. It's like, if, if you're spinning up your cluster from. Scratch, then you acquired a hardware, then that hardware has burning periods. So the first three months stuff will fail and that has nothing to do with the model itself. It's just, your hardware is also brand new.Dirk Groeneveld [00:11:06]: Yeah. I mean, I am eternally grateful for AMD for giving us the compute to get started, but it was kind of difficult to run on.Nathan Lambert What was the exact amount of compute? Like, I think when I arrived, that wasn't even what we're using where it's like Lumi discussions and the original amount.Dirk Groeneveld Of compute was, uh, 2 million hours on Lumi.Nathan Lambert So, so 2 million GPU hours.Dirk Groeneveld [00:11:29]: Um, that's we're training way bigger now than that. Yeah. So I think I did the math recently. It's like the order of a million hours is if you do a thousand GPUs concurrently, like 20 days. Uh, I don't have that math in the top of my head, but, um, the first, the first end to end run for the 7B that we did took, uh, 35 days. We can now train that same. Model again in three days. So things have changed a lot since then. Yeah.Luca Soldaini [00:11:58]: Well, some rough, rough stats for almost two anyways, seven and 13, just the final ones, um, was a little bit over 5 million GPU hours combined. And then we have roughly 5 million hours worth of experiments.Dirk Groeneveld [00:12:15]: Um, these are, uh, A100, H100. Might be surprised. Oh, it's the case too high or too bad to do some, it's way too high.Luca Soldaini [00:12:33]: Um, it's like, how do you encamber overhead then?Dirk Groeneveld Oh, combined.Luca Soldaini [00:12:36]: It's some of them plus the ultimate training. They're also not using the new one core quickly.Dirk Groeneveld [00:12:42]: So, yeah, but I'm just thinking if it's, let's say conservatively 7,000 tokens per second, four months on a thousand. Do you think it's less than that?Nathan Lambert Like, okay, let's just go and track those number down. I think it's interesting. It's like, what percentage, what is the percentage of improvements still? Like how much of all the two being better is just by the compute being more stable just by doing more experiments. And that lets you test things like stability and just get the ducks in a row rather than like the data being so much better. It's an impossible question.Luca Soldaini [00:13:20]: It's that it was like. And, you know, the trigger part with using that AMD hardware at the time, specifically that cluster, was that cluster was being brought up online at the same time as we were experimenting with it. So we were helping that cluster being set up. So it's because of that, there's a lot of things where we had to second guess ourselves, whether that was an issue on our side, the hardware side.Nathan Lambert [00:13:58]: Isn't this always going to be an issue with new GPUs coming into the world? Does Microsoft plug in opening eyes GPUs and they just work?Luca Soldaini [00:14:06]: I think it was, yeah, it's always tricky. It's a combination of like getting both new GPUs. At the time, AMD was a relatively new vendor, plus the cluster itself being new. So it's like stacking, you know, risky, risky things on top of each other in a way that it's like, oh, if you can, if your cluster is solid, that, you know, the GPUs are brand new. Well, the network is not going to cause issues, but if the cluster is new and the GPUs are new, who knows where the problem sits. Yeah.Nathan Lambert [00:14:44]: We'll go down the... Yeah. We'll go down the whole stability round the hole. Dirk, how close are you to a number?Dirk Groeneveld Five trillion tokens at 7,000 tokens per second, which is what we get for the 7 billion, more or less, over the long run, is only about 200,000 hours on each one. So our first estimate was way off.Luca Soldaini [00:15:05]: It was... Check the top. I think maybe my memory was wrong. Maybe my thing was... This is why I have this laptop here.Luca Soldaini [00:15:18]: Oh, no, I was misremembering. Okay. My name is 500K. I remember flying... 500K. Yeah, yeah, yeah.Nathan Lambert [00:15:27]: So it's like from the first AMD grant of a few million GPU hours on AMD to what we have today. It's like it's gone from multiple million AMD hours to training a model over five times the tokens in half the GPU hours. That's right. Yeah. Like, where do we...Dirk Groeneveld I mean, the biggest one is that the MI250 that Lumi has on, like, the MI250 is the AMD GPU that Lumi has, is of the A100 era. It's comparable to an A100 in price and capacity. But now we train on H100s, and they're just...Nathan Lambert What percentage of tokens... It's just a newer GPU. Yeah, what percentage of tokens in OLMo 1 code versus OLMo 2 code are lost at, like, a 7B, so a scale that we're reliable on? What percentage of tokens in OLMo 1 code versus OLMo 2 code are lost to spikes?Dirk Groeneveld I think it was OLMo 1 losing a considerable amount against the spikes game. That's impossible to estimate, because there's so many other differences at the same time between OLMo 1 and OLMo 2.Nathan Lambert Can you summarize the architecture differences? There's a list in the paper. We don't have to be exhaustive.Dirk Groeneveld That's going to be a lot of stuff. The biggest difference is the init. So I guess now we're getting into what did we actually discover?Nathan Lambert These are some audience questions. OLMo 1 and OLMo 2. Finbar, who you might know specifically, asked, like, how do you write an init N(0,0.02) to an init? I'm like, I don't know.Dirk Groeneveld That particular init is the default in Megatron. And the init that we had in all one was just trying to be too clever. We stole that init from OpenOLM, and they took it from somewhere else, actually. And I don't remember what the original source is.Nathan Lambert What is the actual decision-making on an init that's too clever? You, like, think that you can get a better learning region by bundling with something?Dirk Groeneveld We tried it. We ran it for, you know, 100 billion, 200 billion tokens, and we looked at which one is better. And scaled init is absolutely better for a long time. So scaled init is the original. It's the OLMo 1 init. Works better for a long time. You have to train for a really long time before you see it come apart. You have 2 trillion tokens for a 7Bmodel. And then things get a little bit dicey. So this is why, you know, this is why we used it for OLMo 1, because it looks quite good for a long time.Nathan Lambert Which of our OLMo models did we figure out that the init was a change?Dirk Groeneveld Because we did a few through the year. We tried that same init with a 7D model, and that did not work. That model stalled out around 1.3 trillion, 1.4 trillion, something like that,Dirk Groeneveld [00:18:12]: which gets at the heart of the stability. So we started to think about the stability investigation. So I think that was one of the audience questions, right? And how do we even go about the stability investigation? starting from the point of we're training the 7DB and it's not working anymore, what did we do? The first step was to identify the issues that we see in the metrics and see them in a smaller model. And the two issues we saw were lots of spikes that we call them fast spikes. So the model recover. They recover quickly, but they just happen more and more the longer you keep training. And at some point, even the fast spikes kill you.And the other thing was a growth in GradNorm. It seemed very much that the 7DB would always start blowing up once the GradNorm got to 0.4, regardless of what intervention we did, it would get a little bit further. And then as soon as we hit 0.4 GradNorm, it would blow up again.Nathan Lambert So you lowered the learning rate and it was up again.Dirk Groeneveld So fortunately, yeah. Yeah. So we would do things like that, increase the batch size, change the late decay, blah, blah, blah, but quickly it gets back to 0.4 and then blows up again. So fortunately, both of those phenomena also appear at the 7DB, even though the 7DB trains fine, it has both of those traits. So we decided to focus on those two because it's too expensive to try all these experiments at 7DB. But these are two things we could fix at 7DB and then see how it goes. So that was, that was the first step. But now. Now we have a metric where we can pretty quickly, within 12 hours or so, do a run, find out if our numbers are better and then change something and do it again. And the second component was we took another model that successfully trained that didn't show these issues, that didn't show the slow GradNorm growth and it didn't show the spikes either. And we ablated against that. So that was the LLM-360 Amber model. They're like all very open. So we could take their data. We could take their setup and look at it in great detail.Dirk Groeneveld [00:20:22]: And we basically tried things one by one, sometimes two by two or so to not run too many operations. But we tried things until we got to a stable setup. There are some other insights at the time. I was really into the Spike No More paper, which is all about the magnitude of this. So we tried embeddings. So we tried some stuff there.Dirk Groeneveld [00:20:48]: Pete Walsh on our team tried some other stuff involving Adam W settings that made things even better. And then we took a lot of inspiration from the Chameleon models because we were talking to that team on a semi-regular basis and they had a lot of stability issues. They found some solutions that we also tried and some of them worked for us and some of them didn't. And we took the ones that worked for us. So it's always ablating at the 70 scale until our numbers look super smooth and super nice.Nathan Lambert How specific do you think these are to our setup? Are these all OLMo specific insights or is it just kind of a process you have to walk down? We've heard some of these things before. It's like all these developments are you have to do the previous type of thing before you can go bigger, do a more complicated model. Do you think that's actually true or is there just best configurations at the time?Dirk Groeneveld I really don't know the answer to that. It's hard. But something I want to know, something I want to do for OLMo three is walk back a few of these things and see in retrospect which ones are actually necessary. And in particular, I'm hoping that some of those are not necessary and they're costing a bit of performance, you know, just to boost our own efficiency a little bit.Luca Soldaini [00:21:54]: In general, I don't know, you can tell me if there's a useful summary, but it seems like the space of intervention you can take is so big. And other model, they're not going to translate perfectly, but the hit rate to like find a good solution is higher if you start from that model and you explore around it versus like try to explore like the full space of possible solutions. Yeah. And then some things will not pan out once you try to rerun them on your setup. And I don't think that's an indication of like necessary . Yeah. You know, we can mistakenly reimplement their thing, not in the way they're supposed to be. It's more like some things translate, some things don't. But it's a good starting point.Dirk Groeneveld [00:22:55]: Yeah. I mean, we are a fairly conservative bunch with this, right? Because even the 7B runs are actually kind of expensive. So make small changes from a known baseline by and large. Yeah. I mean, everyone has.Nathan Lambert Yeah. And risk is pretty obvious when you look at the cost numbers and like who you are trying to beat or not. And it's like we are trying to try to plot or people can build on it. And it's much better to keep making small progress than it is to go for glory runs and just hope that works. I think both works. The more compute you have, you can have a bigger distribution of investments, but it's not that surprising.Dirk Groeneveld I mean, I hope that we can be a lab that is a little bit more risk tolerant than others. For one thing, we don't have Meta's resources. So we should be a little bit more aggressive. You know, it would make me much more nervous if I had to bet a billion dollars on our next run than the amounts that we can bet. So we can try a little bit more. I also feel and I hope that our management agrees with this. I feel that if we always, if we're always safe, if every one of our runs works. That means we're not trying hard enough, right? We have to occasionally crash and burn.Nathan Lambert I think there's a few every year that you should crash and burn. I think these crash and burns at the big scale get a lot of attention from media and stuff. But it's like, what do you expect them to do? If they haven't, you're walking up a line and might as well try to take three steps at once every so often. Exactly. But I do agree. I think that's a cultural thing that we're trying to navigate. It's like, how do we do more interesting stuff and not just fall into the trap of being the best? Open model. No one else is doing this. Like, okay, you could do that for a while, but it's not as motivating.Dirk Groeneveld And it's not just because it's more interesting to do that, but also just the fastest way to make a better model. The fastest way to calibrate your risk tolerance properly. You have to sometimes be over. Yeah. It's inevitable.Nathan Lambert [00:25:05]: Any follow ups on risk?Kyle Lo Yeah. I'm thinking now it's like, because the 70B crash was so sad. Yeah. And I'm wondering if you look back on it now, it's like, that was the greatest thing for us. We learned so much from that.Dirk Groeneveld [00:25:19]: It was very important to love too. I do a little bit. So, I mean, we felt terrible, right? Like this was an awful time for us. I was like, I'm done. Let's get good questions. No, we were the training team that couldn't train at all. I felt so bad. But the work we did following up is some of the proudest I've been about the stuff I've done in my time at AI2. Yeah.Luca Soldaini [00:25:47]: In general, my thing about the role of OLMo sort of keeps evolving, right? It was very natural to have OLMo as these models designed to help others do research and language models. That's how we initially, it was a big part of OLMo 1. You just release all the components because it's important to have these tools available to everyone. To study language models. And I think we serve that community well. One thing that it's, I hope we can do with OLMo more is that there are like some interesting aspects of language models. Interesting capability, interesting architectural decisions that for a myriad of reasons, they sort of get overlooked in like say a company or like in a framework where, you know, you have certain constraints in your model. But it's still there. They are important. And there are questions around like what a model should be able to do, how it should operate, and things like that. But I think we can take a role where like we have in general this recipe that both enables research and language model and for like subset of model capabilities that we think are fundamental. No one is touching. It's our space to do work there. I think the prime example that I keep repeating these days is what we did with MOLMo andLuca Soldaini [00:27:25]: vision team was mostly working on it. And MOLMo is very good vision language model in general. It benchmarks up there. It's not the best, but it benchmarks up there with open models. And then it has this like this interesting point. Pointing capability that no other vision language model has. And that pointing capability is, turns out, is fundamental for a lot of language models and robotics that you want to build. It's a core capability the same way that a text model should have long context. And it was cool to make, to sort of emphasize that of like, oh, we have the specific capabilities that would enable all these applications. And so more people should work on like the specific aspects. So I think that's a cool way to like work on things that folks haven't had a chance to touch on yet.Nathan Lambert [00:28:24]: I think it's like trying to parse out why this type of situation could happen is not easy. Because you generally, everybody would want to do this. Like everybody wants to come up with a new capability that expands the scope of what X type of AI model can do. And I think it's most of like probably goes down to the culture of where people have space. To think about stuff in a more interesting way. It's like, because obviously everyone wants to have breakthroughs and open AI and Anthropic that copy. But it's like sitting at a boundary between doing just the same stuff and doing more researchy stuff that you need to have. I have more architecture questions. One is MUP. Multiple people are asking about it. I still don't really intuitively know what it is. But are we going to use this?Dirk Groeneveld We have done a fair bit of work into it. And it hasn't worked for us yet.Nathan Lambert Can you explain what it is?Dirk Groeneveld MUP is mainly a way of setting the learning rate, but also some other hyperparameters. By training only small models and then having a guarantee or at least a pretty good idea that it will work also for larger models.Dirk Groeneveld [00:29:33]: We have implemented this. We've experimented with it. So far in our setup, it works across model sizes. So the learning rate that it predicts you should use, it doesn't predict the learning. It just gives you one learning rate. Basically, the good learning rate for the small model is also the good learning rate for the big model. That works if we change the size of the model. It does not so far work if we change the length of the training run. And that's why we haven't been using it so far.Like number of tokens.Yeah. Or longer. If we double the length of the training run or we 10x the length of the training run, the optimal learning rate is different in our setup.Dirk Groeneveld [00:30:21]: It seems like this might be a bug. It should work, but it doesn't.Nathan Lambert And the positive gain is just that better scaling because you don't have to fiddle with the certain. You know you're getting the right learning rate, which is a crucial hyperparameter.Dirk Groeneveld Yeah. It's just a better way of setting learning rate. And it works for a few other hyperparameters too.Nathan Lambert But there are other open models that use this. Explicitly. Pretty sure. I mean, open weights model. Yeah. Those are linking. Like Llama and stuff using this. Llama does not, I think. But I don't know for sure. We'll always see with the next iteration. Even Llama3 felt like they were still building their org and their infrastructure so fast. It's just like get in what you can get in and there will be more models in the future.Dirk Groeneveld Yeah. I mean, MUP is a shortcut, right? Like you can for many settings where MUP wouldn't work. Or you have to just establish scaling laws and predict what it will be. You could do the same thing for the learning rate. Just MUP lets you do this with even fewer runs. You know, you don't even have to extrapolate anything anymore. You just use MUP and your setting will work. That's the idea.Dirk Groeneveld [00:31:29]: But you kind of already need a scaling law set up anyways for things that MUP doesn't work for. You know, like architecture changes and so on. Yeah. So in that sense, it's not that important. It's still pretty important. And we're going to keep trying to make it work for us. Maybe just find the bug. But it's not absolutely critical.Nathan Lambert How does scaling laws actually tell you the way to change like the width? Do they actually tell you the change in width or the depth, like the proportions of the network relative to the size? Like what are the actual output variables? Or how are you controlling the architecture you're going to use in the scaling laws? Well, like I know what it's trying to predict, the accuracy, but are they on set architecture things?Dirk Groeneveld You would usually vary one thing.Dirk Groeneveld [00:32:17]: Like you don't vary anything. You establish how it scales with size. Yeah. And you set your size according to a certain formula. Like you might say, I will go 1.4x the depth and 1.4x the width. So I have roughly 2000 pixels. That's a bigger model. And you do that a few times and you draw it on a graph. Then you change your architecture. You do it again. You draw a different graph. You lay them over each other and you hope that the lines don't cross. And one of them is clearly better than the other.Nathan Lambert Yeah. I definitely have known that there's some, it's like one of the obvious things architecture design and the not obvious things. It's like you obviously make the model bigger, but the subtlety of like how tall versus wide. I think we're talking about like a client that's like much deeper than ours, our model architectures. And it's just like, I'm around these things and I don't have an intuition for if tall or wide is better. And I think it's like what works.Dirk Groeneveld There are some early results from Google, I think. I think they're called efficient net or something. That suggests that over a wide range, it doesn't matter whether you go wide or deep. It's not that surprising. That's pretty old results now. We're following up on a particular result right now. Actually, so OLMo 2 is a 7 and a 13, right? But there also was a 1 that didn't work very well. And we're trying to find out why. And one thing about that model was it was pretty wide and not very deep. So we're checking whether that is the reason why it wasn't very good. So we're sort of in the middle of double checking this assumption that it doesn't really matter whether you go wide or deep.Nathan Lambert Yeah, that makes sense. I think that is something that doesn't matter to most people. They're probably very interested in it. Just like how they have these blocks and how do they decide. And it's like just one of us decides.Dirk Groeneveld And it's like, eh, seems right. There are other concerns, right? So we train with FSDP, with 0.3 sharding. So we can try to choose these sizes such that they utilize the GPU in the optimal way.Dirk Groeneveld [00:34:29]: Which has nothing to do with the sort of abstract training dynamics. It's just the practicality of getting this thing into 80 gigabytes of memory. So then those concerns might take over. There's other stuff like all your tensors, all your tensor dimensions need to be multiple of 64, 128, things like that. GPU math stuff. Yeah, exactly.Luca Soldaini [00:34:53]: It's really hard to argue against things that are practically making you run fast. Because it means that if I find something that is 20% faster, your big run trees fast. All the experimental cycles are 20% faster. So it's not very glamorous. But everyone is really happy when we find one of these. Like, oh, this is a shortcut.Dirk Groeneveld [00:35:16]: I find it super glamorous. I mean, when did you ever have such a clear sign of impact that you can say, I wrote this thing and it is not 20% faster? No, the impact is very good. Yes.Nathan Lambert The numbers you're changing are not necessarily glamorous. It's just detailed stuff.Kyle Lo [00:35:34]: I also think the experimental cycle thing is probably the biggest thing for me. What we're seeing consistently is the more experiments you run for a particular idea, the more likely it is to just work out. It's just a function of trying more things.Nathan Lambert [00:35:47]: It seems like in the pre-training, there's very few, like, you just get the idea. I mean, well, I said post-training more. But literally, like, we had a meeting with John Schulman. He was like, everyone, lead labs, train RL and athletes do this. And we got, like, a three-month head start on one step. But pre-training, all that stuff. I think it's evaporated.Kyle Lo [00:36:05]: The human intuition piece is just gone. I think once you do v0, you can kind of do everything with intuition. It's like, oh, look at data. This kind of makes sense. This seems like . And then after you get to, like, v2 of something, it starts becoming really hard to make sense of what is good for a language model or not. So you kind of just need to just try a bunch of stuff.Dirk Groeneveld [00:36:29]: And then there comes a game of stacking improvements that are worth 2% to 5% each.Nathan Lambert I think it's very compounding, at least in all the math, works out over a year. I think I want to ask about MOEs as well, if you have a different thing you want to say. But it's mostly, like, it seems like we have a OLMOE, which, if you look at the plots on paper, it's like this MOE architecture beats all of our own things and carry efficiency. But it seems like we had a path we needed to go down to make sure dense works really well and get all these improvements. And then you have to, like, feed back in. And you, like, merge the MOE streams. We have DeepSeek. We have Minimax. There's countless other MOEs that get really high eval scores. Like, they're not as easy to do research with because they have tons of total parameters. And people need bigger clusters to fine-tune them, blah, blah, blah. But it's like, is MOE something that you think we just need to do to make better models?Dirk Groeneveld Well, it's a complicated question, and we haven't quite answered it yet for ourselves.Dirk Groeneveld [00:37:34]: We did investigate doing a bigger MOE. And we found that the engineering is somewhat difficult. And at the time, we came to the conclusion that we could do that engineering, but then who's going to run that thing later? They also have to have a team of engineers on top of it to make sure they can train this.Nathan Lambert What does the engineering look like? It's not, like, CUDA-level kernels. It's how you distribute parameters?Dirk Groeneveld It's a little bit like... It's a little bit CUDA-level kernels in that... If Mega Blocks by itself isn't enough for you, then it gets really complicated. And we ran into that situation where if it had to be significantly bigger than what we did, it just got too complicated.Luca Soldaini [00:38:22]: There is an inference. These very big models that really get advantages by... If you tailor them to, like, where you're going to do inference with them. So if you're a big company, you start thinking about, like, how to batch request, how to, like, serve the model. But if we could do it ourselves for the place where we're running, but then you start thinking, like, oh, folks who want to use their model in their hardware, they're better served by advanced model than also redoing this engineering on top. Like, there is, I think, a clear advantage if you are... Also providing an API to an MOE. Yeah. Very clear cut.Dirk Groeneveld [00:39:10]: It depends on how we think of the product of ALMO. And the number one is still it's an item to be researched. So other people need to be able to train on it and to modify it and so on. And that is just much easier if you have a dense model. Yeah. If you think of it as something that gets put into a product. And people will run tons of issues. But if you have a lot of inference on and you only really care about the final score that it gets, then maybe the MOE starts making a lot more sense again.Nathan Lambert Yeah. That's a good answer. I think it's, like, I think people can fill in the blanks of, like, what we may or may not do.Luca Soldaini [00:39:53]: And I mean... I mean, like, different, like, I'm curious, like, what, like, folks at Llama, the Llama team think about MOE.Nathan Lambert [00:40:03]: If the Meta AI exists, they're 100% going to do an MOE.Luca Soldaini [00:40:06]: I mean, it's interesting, right? It's, like, if they're serving few, if they're expecting that the Llama users are going to be, in fact, one of the better smalls are few large companies that can figure out inference, then MOE makes sense. But if they're thinking about more, like, this model that wants to, it's great if it's adopted by a million developers, large and small, then, you know, they're still going to reach a lot of dense model. Yeah. Exactly. That development is so easy, so much easier for people to set up their own inference with a dense model.Nathan Lambert [00:40:40]: Yeah. I think we've gone surprisingly long without asking about data. It's, like, how much more, is it just an infinite hill to climb on data? It's finding good data and filtering bad?Kyle Lo [00:40:53]: I mean, I think it's an infinite hill to the extent to which everything else is also, and you can kind of keep improving, right? But yeah, it's the main threads constantly are. Got to get more data, because if you're working with larger pools of data that you can't actually get easily new data that's not in your distribution, it's probably interesting to study how that adds in. And you have more to work from. So if you have, like, a strict quality filter, you can still get your high token yield if you start with a much larger pool and filter down. So getting more data is really, really critical, especially if you can target specific pockets that you think is missing. You can always keep iterating on better filters. Understanding how those filters affect performance. And everything kind of interacts with each other. Like, safety filters interact with quality filters, interact with deduplication, interact, like, all these together. So there's an infinite, even ordering, search space between these operations. So keep throwing more things at it.Luca Soldaini [00:41:53]: Yeah, it's very much just stacking small improvements. Yeah, shots on goal. I think the way it looks is, like, it's... For each... Now that we have, like, these multiple stages of pre-training, we think about, like, what kind of improvement you want to get from data at all the various stages. Like, clearly, the improvement you want to get from data you put at the end of training is different than the improvement that you want to see at the beginning. It comes with a different set of requirements. One thing that is really useful is... Intuitions are always often wrong. But one thing that it's worth spending time on is figure out... If you have a data ablation idea, what is the fastest way to disprove it, which requires a little bit of experimental design. And then, yeah, you've got to fiddle with, like, especially, you know, when you do the first version so that you can take a very... It's very easy to measure improvements. And then as you start thinking, like, refined version, then, you know, you've got to think of, like, how you measure your improvements or so. But, yeah, it's... There's no, like, big... After you're done, you know, the basic stuff, your V1 is done. There's never, like, a big, like, thread of, like, this is the one data thing. It's more, like, stacking your Lego bricks to get to a better model.Nathan Lambert [00:43:18]: Do you think you can iterate faster on, like, end of pre-training, whatever you want to call it, like, highest quality bit training and the only data? Yeah. Have you, like, started that recently?Luca Soldaini [00:43:28]: I think it depends on the... What we're getting, you know... We... We need a little bit more evidence of this, but it depends on the role of data. Like, it's very much... The reason why we started doing mid-training at all is because we were interested in having base models be primed with certain capabilities that we didn't get during the long pre-training phase. And for those, it's really easy to iterate on new data sources that would improve on those capabilities at the end, pre-trained. But during, like, the pre-training phase, why not the important aspect that we think about is, like, efficiency of your data is, you know, if there is a version of your data that is where train on it and the model gets to performance X on 20% faster, it means that you can train 20% longer, right? Or run more experiments. Or run more experiments. And so... But for those, it's, like, you know, it's... In some cases, you can use mid-training as, like, a proxy for this. In other cases, it doesn't quite make sense, so you have to come up with, like, maybe experiments through scaling laws, maybe experiments through some other technique. But yeah, it really depends on, like, what role a data set plays into, like, the various stages of pre-training.Nathan Lambert [00:44:53]: So it seems like, like, compared to Dolma 1, which is, like, do the thing, it's all targeted abilities. It's, like, we want to be better at things. We put people on this. It's, like, targeted abilities or where we think we can get a lot of data.Kyle Lo [00:45:05]: Like, a certain data source that hasn't been mined for stuff. Yeah. Yeah. We have to be opportunistic because it's so hard to get data. And for us, especially if we want to be open with the data, it's, like, we have to also do it by due diligence. Like, we're going to study this data, put all this effort in, and we're still going to be able to share it with everyone. So...Nathan Lambert [00:45:22]: If you were in a lab that didn't release data, do you think you could make more progress on it? Like, how, like, how much is that actually?Kyle Lo [00:45:27]: Oh, yeah. Oh, my God. Such a time sink.Luca Soldaini [00:45:31]: I mean, it's, like, it's a little bit of a mistake that we put in. Yeah. Like, and this is not even, like, doing, you know, getting data that managed to not legal, right? You could form partnership. You know, you have people knocking at our door all the time saying that you want to buy this data set. And they're, like,Nathan Lambert [00:45:48]: I've been contacted by one ML owner to try to facilitate a data deal.Luca Soldaini [00:45:52]: Oh, yeah. Twitter. Oh, my God. But only the first, the first follow-up is, like, are you cool if we release the data? Of course, they're not. Yeah. So, it's, like, it's, it's, even, like, there's plenty of data that you could acquire from people, but then you can't release it. So, that's, that's a complication to, to progress.Nathan Lambert [00:46:15]: Yeah. This is more of a self-question, but, like, how much do you think mid-training should be, like, a philosophical shift in how we organize teams? Because it's very easy to do. I mean, we've already consolidated, like, our training and data to base, which is not surprising. But this is mostly hypothesizing on what other people do. It's, like, how close do you think this kind of end of pre-training to post-training handoff should actually be?Kyle Lo [00:46:40]: I think it's, it makes sense as a thing if, I think these things are, in theory, arbitrary, but you can think of, like, in the extreme, if you had a perfectly oiled machine, you have a very smooth transition between pre-training to mid-training to post-training, and it's actually, there's no boundaries. Like, that's, like, a theoretical. You can probably squeeze a ton of performance by smoothing that out. But in real world, stuff, stuff is messy. So the real world is your three trillion tokens into your base model run, and then you signed a new data deal. You got to do something with this, and you're going to undo your training one. Well, you got to figure out something. So maybe that's mid-training, right? Mid-training is when you have an opportunistic need for something, or you're training something and someone catches a bug, which happens all the time, like a data bug or some training bug, and you're like, oh, I had to patch it. So then there's the shift fundamentally. You got to know how to deal with this. So just because these things aren't, these large training runs aren't super repeatable, and they take so much time that the world state changes all the time, you always need some strategy on how to deal with, oh, I'm near the end of pre-training versus I'm near the beginning of pre-training versus... Yeah.Nathan Lambert [00:47:47]: It's like, we're obviously trying to solve long context, so this fits right into this. It's like, we're going to do this thing. Does it go, where does it go? Some people do it in post-training. Yeah. There's some component during pre-training.Kyle Lo [00:48:00]: It's kind of just like, you have to follow a few recipes and figure out what works for your team. Yeah. And so much of it is just, if it's expensive, try to push it off as much as possible. Because if it's risky, push it off as much as possible. If you can intervene to get the same result much later, huge win. You can try a bunch more things. If you have to intervene because it's some core thing that has to be baked into pre-training time, you're kind of... It's a sad space to be in. But then that's the thing where you have to intervene. That's the pre-training data.Dirk Groeneveld [00:48:29]: There's a big question that I'd love to get an answer to, but I don't even really know how to think about it. But the question is, what makes a pre-training model a good candidate for mid-training fine-tuning? Because all we really try to do is we try to maximize our metrics, but we don't really know that those metrics are what makes a good step zero for post-training.Nathan Lambert I think a relevant thing, I don't even know if I've told you this, but I don't know how to take action on this, is we got advice that we have the multiple stages of post-training. In this instruction tune phase, we got advice that's like, eh, it could be a little broken. You can have some crap in there. It'll get fixed later on. And it's like, why is that okay?Nathan Lambert [00:49:14]: It might be the same thing in pre-training. It's like, you want to get in the right... It's more important to get in the right ballpark than the right exact number. Yeah.Luca Soldaini [00:49:21]: It feels like it's more about not how to make a good model for post-training. But what to avoid so you don't have a bad model post-training. Yeah.Nathan Lambert [00:49:33]: There's a whole other question, which is how to make a base model that's easy to fine-tune in general, versus one that, if with the right finagling, can get the absolute best numbers. Which I think, for OLMo, would be really great to be like, here's a super stable platform. A lot of people have complained about specifically That Llama Instruct, it's hard to fine-tune. Which, after most of the post-training. Because this is where people at companies start. They're like, this is the best open-weight model. I want to add a little thing in it. And a lot of people have difficulty in fine-tuning it. It's different at the base, because most people can't do this full instruct thing. But for researchers, having a stable platform at base is way more valuable.Kyle Lo [00:50:12]: There's an interesting... About this, like, what makes a base model a good base model. There's this interesting, I guess, debate that we've had a bunch of times. We've also had with other people. Which is, it seems like there's like two hypotheses on what the role of this... How do you think about data as an effects-based model behavior? There's one hypothesis, which is, you need quality data so that you don't get any spikes. You have stable training. You have no bugs. And once you pass that level of quality, as diverse as possible. It's just about an init to the model, so that it can go in literally any direction. And so, diversity is the next. That's one hypothesis. The other one is, it's all domain effects. The only reason why... Like, you can just keep climbing. There's a notion of quality. But you... And you can keep getting more and more and more as long as you're very clear about what target domain or target application you are. You just keep getting closer and closer. Well, there's a lot of suite learning. Yeah. Well, this goes into, like, the continue pushing. I just like... It's just domain effects all the way down. If you're only evaluating on this particular stuff, you can always get your base model to be better for that. Just keep climbing on it to get it more and more similar. As opposed... And, like, think about, like, I care about this application, this suite of applications, all the way through. From base model... Can you not kind of have both? I feel like I'm confused with how, like, actual generalization fits into this. It's... It's... It's... It's competing ideologies in terms of, like, if you believe in the first one, then you're all in on diverse data acquisitions. And how you set up your team. Yep. You're all in on efficiency and stability for your pre-training. And then you just get as much different data as possible. And you're post-training all the time. If you believe in the latter one, you solve backwards from, this is what I want the model to do. And I make all the changes everywhere to try to squeeze performance out of this class of problem. In the big... In the data, in the bit-training data, bit-training data, et cetera.Nathan Lambert [00:52:01]: How important do you think the actual, like, multi-tag category of every data document is? Like, know that someone... Like, that these people have really advanced tagging of all their pre-training documents. Like, do you... Like, does it essentially say, like, doing that and choosing them? Which is, like, a very much, like, crafting a, like, recipe for your pre-training versus, like, just good numbers. So, like, just get a good classifier and roll with it.Kyle Lo [00:52:27]: We have tags. That's fine.Luca Soldaini [00:52:31]: The tags are useful even if you get this idea of, like, let's use as much as possible. You know, diversity is important. A lot of web data comes with absolutely no useful metadata. You have, like, URLs. URL is very, like, you have to do things on top of it to make your URL useful. It doesn't add much. So, the more you have in terms of, like, categories, metadata information, you can start using this as a tool to try extra technique on it. Maybe it is extra technique to mix your data in a certain way. Maybe it's filtering out things. Maybe it's, like, designing benchmarks. Try to correlate with those. Yeah. Otherwise, it just seems to have this giant bucket with maybe, like, one quality knob. And it's, like, it's very hard to make progress if all you can adjust is, like, one number we cut for quality here. So, it's, I'm not surprised that, you know, the big labs, they almost have these tags. I want to know how they use them. That's, like, the part that's not good. That's the part that's not good. Yeah.Kyle Lo [00:53:51]: But it's also not just you have more levers to pull and then, you know, the more things you can try, the better. It's also you want tags that are actionable, right? So, like, if you had a sensible notion of a tag and you realize, oh, more of this data as you keep adding more of this lever, performance keeps going up. At some point, you might be, like, we're out of that data. We need to go get more of that. Without that tag, you want that tag to be something that's understandable so you can go and negotiate another deal, do synthetic generation, et cetera, of that type of data.Nathan Lambert [00:54:13]: Do you think most of the synthetic data gen, is for very specific things at pre-training? I mean, it kind of has to be. Probably, yeah.Kyle Lo [00:54:25]: You can't just be, like, oh, it's generative data. Like, that's not something, I don't know what that procedure is.Luca Soldaini [00:54:30]: It's probably to prime the model to whatever you need during post-training. Like, you know, we've seen, like, normally with math, it's much better if your model has an elementary knowledge of math to, like, improve on that. It's quite the same with everything that it's, like, oh, I want to do RL on this. If the model is completely random on it, you're going to have a very hard time.Nathan Lambert [00:54:52]: Yeah, it's, like, I guess a good transition. It's, like, what do you three think post-training is, should be, or, like, is not doing?Kyle Lo [00:55:02]: It's elicitation.Nathan Lambert I'm coming around to this view that it seems that you can extract abilities from the model.I think it's totally elicitation. Like, the Hitchhiker's Guide to Data paper from Google, yeah, that one was very, that one had, like, a very specific experiment. But it seemed like that was pretty strong evidence towards it. It's, like, you filter out all of this type of data, you literally can't fine-tune that model. You can never recover that. There was a history detection, right?Nathan Lambert [00:55:28]: I think if you do more flops, you potentially can. I mean, it's obvious, like, we're not talking about, like, O1 stuff things here. But, like, there are even datasets that have, like, 15 million math-only instructions. Are they going to be able to really start doing a ton of math? At some point, yes. Yeah. But I think that most of it, or it's almost easier to operate. I mean, it's just like, assume that capabilities are in this model and are post-training to get it out.Luca Soldaini [00:55:53]: Sometimes there's this very large set of, like, things that you do in pre-training because you have a sense of, like, how they play an application. I think one day it's, like, very obvious. It's like, code model, you want to do, you want them to do completion, you're going to add, fill in the middle of loss, maybe at the beginning of pre-training. It's like, oh, then I can play my entire pipeline around like that. So it's all about... So far, it seems all about that. I don't think we have cracked a good recipe to do the same for things that are not capabilities, but they're, like, recalling facts. Oh, yeah. Or, like, long-term knowledge.Nathan Lambert [00:56:29]: Yeah. It's, like, all of us, like, all know, or, like, I don't know, at least people out there have MLMU numbers that go up in X stage. Like, instruction tuning, boosting MLMU, I'm like, what are you putting in there?Dirk Groeneveld [00:56:42]: What do you think of mid-training then? Is that a manifestation or... Mid-training? I think it's still...Kyle Lo [00:56:47]: I think it's still positive knowledge. I think mid-training is, it's just, it's still pre-training, but with strong domain effects. It's just smoothing out the boundary between, you have a very, very sharp distribution shift when you do post-training, and we know from, like, kind of ML101 from the past five, six years, that smooth, smoothing out, helping, like, transition between major domain shifts helps. But we don't have a clear example of where, like, it helps with specific knowledge acquisition. Yes. For them, we don't know how to do it. But for, like, you that are really easy to evaluate, things that are really big progress on, it's like, yeah, smooth this out.Nathan Lambert [00:57:30]: So, like, why is post-training important to the release site? Some of you guys came around to, like, post-training being important for getting traction later on. Is that just, like, an ML ecosystem, how it works?Dirk Groeneveld Oh, I mean, the base model is kind of useless, right? Yeah. There's only so many next tokens you need to know about. Yeah.Luca Soldaini [00:57:50]: But it's like, you know, we've seen papers that use all the research, for example, where the idea for that research only came by comparing base model with, you know, instruction team model, like, the one where folks, they were involved around, like, certain pattern of speech and OLMo 1. Where do they come from? Do they come from pre-training? Do they come from post-training? And, like, even if you just want to do research, it's kind of useful to being able to compare side by side. So it feels wrong to put a model out that it, like, cuts sort of the problem that you can study in half until you have the post-training ready. And it's useful to have all in one package so you can use it right away.Kyle Lo [00:58:40]: Post-training is just, like, a really, really long eval loop. Yeah. And that's a lot like, oh, base model, you know, a few shots, a few shots on some benchmarks, like, no, no, no. We eval it by post-training it and then eval it in post-training.Nathan Lambert [00:58:54]: Yeah. I mean, to some extent, it is kind of true. I mean, that's how we should think about it.Dirk Groeneveld [00:58:59]: If we could do that cheaply, we would totally hill climb on that metric.Kyle Lo I think that's the metric. Because if base model is the good in it for the post-training, which is the model people actually want to use, then we evaluate it on its own. And on its status as a good in it.Nathan Lambert [00:59:16]: Yeah. So it's like, how do we... And then the question is, like, how important do you think research for post-training on the specific checkpoint is? It's like, how important is genealogy versus, like, general recipes? Because I think we... I openly think we under-index on using one model. Because much like the path to stability, which is a eight to ten month really specific thing, I'm guessing if you're really just, like, in a narrower regime, you can just keep kind of turning these little things. Yeah. Hopefully at some point we can do better with new models. Yeah.Nathan Lambert [00:59:52]: Okay. We're kind of going to some wrap-up things. How do you think about release decisions? Like, should AI2 release everything that we ever tried? Or is it, like, when should we actually get models out the door?Dirk Groeneveld I mean, I would love to do that, actually. Especially the failed runs. You know, like, where else could you get a repository of failed runs? Yeah. I mean, I think it's just a matter of giving other people the possibility of looking into these failed runs and finding out exactly when they failed. In practice, that's super difficult. Because just releasing something is hard. You know, you need to upload the checkpoints and translate them in a different format. And you have to describe what you were even trying in some way that makes sense to people outside of the org. Give them access to the weights and biases and to the logs. And it's just a lot of work. And there's always something else that seems more pressing than that.Nathan Lambert Seems like a scaling. Like, how much we can share is capped by how we can scale our org. Which, like, we're not going to have a complicated management hierarchy or, like, an entire org that is just support. And everything you upload, you build as a support burden. It's like, literally, we just have seen the envelope grow, grow, grow. It's like, more people use our things, you get, like, boring support. Like, people want to use it. That's the cost of it.Dirk Groeneveld I guess it's a great problem to have. People want to use it. People want to use us.Luca Soldaini [01:01:15]: And it's funny. To make a checkpoint where, like, some are very useful, you need the person who was involved. You have to fill, right? You need the person who was involved in it to sort of pour their knowledge into a format that then, you know, people can consume outside, right? Otherwise, you know, we would just open up our S3 bucket, the checkpoint, and it would be, like, utterly useless. Because what if you wanted to know more of the parameters, so, like, as long as we optimize for release, then we have the bandwidth to provide, like, the support around that. If people want the 70B fail run enough, you know, I'm sure we can release it.Nathan Lambert [01:01:57]: It seems like it's just finding the right medium to release things. Like, I think long-time people reports are really good for the stuff that we do, because it just puts everything in one place for people, and it almost makes on-demand easier in the future. Whereas, like, we could just drip and drag models out all the time, but, like, that's not something we can't do. It's just, like, not... In terms of making progress in things that are easy to build on, it's probably just not worth it.Kyle Lo [01:02:19]: In fact, there's even a cost to it, right? The big example here is we had a release of OLMo 1 0724, or July 1. Yeah. I think research, using that for research, that has been probably one of the tougher models, because it didn't come with a blog post, it didn't come with, like, some docs. And so, yes, it still waits for checkpoints and everything, but comparatively, usually, even when people come to us, we're like, oh, we recommend you use 0424. And now with OLMo 2, we're like, oh, that's the one we recommend, because it has all the documentation. So just dropping something doesn't seem like it really helps.Nathan Lambert [01:02:56]: I would say we should move faster than, like, the 1-2 iteration. But the in-between is not necessarily even worth it. Which is very odd, when you think about being fully open. It's just, like, kind of with the costs of doing business.Kyle Lo [01:03:10]: It's like being fully... You want to be fully open, but you don't want to add noise. And you don't want to waste people's time. Right? So if you drop something that's kind of half done or half baked, and people start spending time on it, only to get frustrated later, you've cost them something.Nathan Lambert [01:03:22]: How does this relate to, like, how pre-training is changing? Like, do you think we need to invest in... Like, openly, a lot of startups are changing their relationship to training. And if they're going to use Llama or pre-training or customer data, and then we have X compute budget, and does any of this come into play? Or is, like, it's all the same with the talking? It's, like, continue to hill climb, do what you can, reasonable trade-offs, and who will actually use the models? It's, like, not too different.Luca Soldaini [01:03:54]: I think that the... So for me, the cutoff point is, like, is there something useful and generally interesting to add if you pre-train? The case of Llama, all this, like, mid-train things that we concluded, we done. It couldn't be as clean if we started with an already pre-trained model. So it's, like, is there really something useful to add to the conversation if you pre-train? If we get to the moment when the answer is no, or, for work, like I was saying. But it feels there's still value to add to the conversation. At least in the research side, like, pre-training, there is tonight a question of, like, we know how to help researchers. We want to help more than just researchers with the models we put out. And if we think there is this application that we can do a very good job, or just this use case, a very good job by starting with someone else's pre-trained model, we shouldn't waste compute on pre-training from scratch. Just saying. We can solve that. But it's an ever-evolving question, really. It's, like, I don't know. We can make decisions six months out, maybe? Maybe a year?Kyle Lo [01:05:24]: Well, that's what I would say.Kyle Lo [01:05:27]: I know. You're the pre-training. You're the hardcore who's pre-trained some models.Dirk Groeneveld [01:05:34]: There's lots of runway left in pre-training. The big labs are fairly conservative because they have to be. But that doesn't mean that we're done. I mean, it's not that we're done. I also feel that the point of all is to make pre-training research accessible to more people, because even if you don't have the resources to pre-train the whole thing from scratch, you can still use our checkpoints and use our code to prove out some sort of improvement. And as we've seen in other areas, even Microsoft tries to push .NET or Apple tries to push Swift or whatever. They try to, like, it's a really big effort for them, and they try to push this. And the open-source community says, I don't care. We're going to use Python. And Python wins. So if you can somehow enable the vast resources of a million people banging on a thing, even a company like OpenAI or Meta cannot compete with that. And with OLMo, I'm hoping to capture that a little bit, that if we can capture something with some of the open-source enthusiasm and the academic enthusiasm.Nathan Lambert Do you think it'll get better this year? Because a lot of academics are bringing on tens of hundreds of H100 clusters around the country. Like, before, it was like just Harvard had 500 and MIT or whatever. But now it's like the long tail of universities. Like, there are a lot of people.Dirk Groeneveld [01:07:12]: And then, you know, if you have 200 H100s, you can at least establish scaling laws for your idea. So, like, what I'm hoping is someone uses OLMo to try some new thing, establish the scaling laws up to a 3B model or whatever. Then we take it and we prove it up to 30B or whatever our computer allows. And if it still works, then, and if it's probably going to open, then I take it. And let them win. Let's not win. Yeah.Nathan Lambert [01:07:36]: I mean, they would never tell us that they'd win. Yeah. Like, what do we need to achieve this? Do we need resources and compute and certain people? Like, do we need more feedback from the community? Do we need feedback from people at labs telling us which things to do?Kyle Lo [01:07:48]: Compute and people, for sure. That is undeniable. If you have more computes, you can try more things. We can go bigger. If you have more people just trying more things, especially like on our artifacts, we'll just learn so much more and not have to just like, we spend so much time guess working, trying to piece together things from other people's pieces. And sometimes it's nice to just get something out of it. So if they did this on OLMo, we can immediately start working off of it. So people, compute, always, for sure.Luca Soldaini [01:08:20]: One thing that I, we get a lot of feedback, but it's like, I really like AI2. I would like to use OLMo, but it's missing this feature, which is great. I love that feedback. It's helped us a lot in prioritization. If we could get more, I would love to also get like aspirational feedback of like, none of the models is doing this. But I have a good case for this. Yeah. Those to me are always like very inspiring to read. Whether we'll do it or not, it's, you know, it's a question of like, can we do it and how it works with other things.Kyle Lo [01:08:55]: But those are always very, very welcome. You know what would be really cool? I think what would be really cool is like more projects in space that you can't do unless you have some sort of fully open constellation of artifacts. Yeah.Nathan Lambert [01:09:09]: The thing that Dirk, does anyone ever do the thing where you load the model into one GPU and like iterate through the batches to find the one that, what happens when it blows up and the, or like when a loss spike happens?Dirk Groeneveld I mean, to some degree we did this ourselves. Yeah. But it's like something that people can do. It's not like we wrote a paper about that, but, but yeah, I would, I would love to see a detailed write-up of like millisecond by millisecond, what happens in a retention when a loss spike happens. You know, how, how does it actually happen? These are the things that people can do.Nathan Lambert And it's like, you just have to keep, keep zooming into a specific level of details in what happens.Dirk Groeneveld Yeah. I mean, right now we're having, someone is using the various checkpoints to see how a certain metric that we're interested in develops throughout pre-training. Yeah. And it's like, you can do that with fairly minimal compute. You don't have to be AI2. Yeah. It's like one of my favorite weird language law papers. It's the Sander Land fishing for Magic Karp paper. And it's like, you can get much more actual feedback looking at weird tokenization. Yeah. You can get much more actual feedback looking at weird tokenizer impacts and tokenizer data interactions on old mode than just picking API models and figuring it out.Kyle Lo [01:10:20]: A lot of also, there's a lot of really forward looking at the checkpoints that we have with the data patches and trying to do something like, okay, let's replay this, the, between everything between these steps by rejecting some different data or manipulating the data between these two checkpoints, just to see how it turns to something different. How big of a fork does it go through? Yeah.Nathan Lambert [01:10:39]: Like if you add the same intervention, like how big does it go through? Exactly. And just to see how it turns to something different.Kyle Lo [01:10:43]: So it's like reconverge. Or early in pre-training versus later in pre-training same interventions, messing with the data. It's just like, that stuff is really cool.Dirk Groeneveld [01:10:49]: I mean, I think there's, I've complained about this for a long time. Grad students, I think, are a little bit hesitant to go into pre-training stuff because they need to publish four papers a year. And it's pretty difficult to do that when your cycles are so long. But on the flip side, it's a bit less busy a field. Yeah. So less likely to get scooped if the field doesn't change out from under you while you're in the middle of your project. Yeah. Post-training is not quite as much as it happens on your side.Nathan Lambert It makes no sense. It's just like, pick something you want to do and people will probably do it. That's okay.Dirk Groeneveld [01:11:31]: So I'm hoping that by publishing all of this stuff and making all the checkpoints available and the data and so on, we can enable more people to work in that side as well.Nathan Lambert Yeah. Anything else you guys want to add?Kyle Lo [01:11:49]: Like, comment, subscribe.Kyle Lo [01:11:52]: Yeah, I think that's it.Nathan Lambert [01:12:01]: Okay. Thanks for listening. If you have questions for any of us individually, the Blue Sky and Twitter handles for everyone in this podcast are below. And you can reach out to the general OLMo contact at allenai.org. That's an email address. Or we're really happy to help and we want to keep building this kind of open scientific ecosystem of language models. So all the best. Bye bye. Bye. Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    1:12:43
  • DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs
    Full post for links, images, etc: https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1I have a few shows to share with you this week:* On The Retort a week or two ago, we discussed the nature of AI and if it is a science (in the Kuhn’ian sense)* I appeared on Dean W. Ball and Timothy B. Lee’s new podcast AI Summer to discuss “thinking models” and the border between post-training and reasoning methods. Listen here.* Finally, a talk I gave at NeurIPs on how I think about post-training for AI applications is now public.This post is likely getting cut off in email inboxes — I recommend reading online by clicking on the title!Yesterday, January 20th, China’s open-weights frontier AI laboratory, DeepSeek AI, released their first full fledged reasoning model. It came as:* A flagship reasoning language model, R1, trained via a 4-stage, RL heavy process. It is MIT-licensed which means companies and researchers can build upon and train on its outputs to accelerate the development and deployment of reasoning language models (RLMs).* An RL-only reasoning model trained directly from their V3 base model, R1-Zero (used to create training data for full R1).* A suite of open-weight models finetuned with supervised finetuning (SFT) data derived from R1 (similar data to one of their intermediate training steps).* A technical report detailing their RL training methods.* Models are available at chat.deepseek.com (via DeepThink) and in their new app.This post is less about the evaluation results (which, of course, are extremely good and shown below), but rather about how training is done and what it all means.This is a major transition point in the uncertainty in reasoning model research. Until now, reasoning models have been a major area of industrial research without a clear seminal paper. Before language models took off, we had the likes of the GPT-2 paper for pretraining or InstructGPT (and Anthropic’s whitepapers) for post-training. For reasoning, we were staring at potentially misleading blog posts. Reasoning research and progress is now locked in — expect huge amounts of progress in 2025 and more of it in the open.This again confirms that new technical recipes normally aren’t moats — the motivation of a proof of concept or leaks normally get the knowledge out.For one, look at the pricing of these reasoning models. OpenAI was likely charging more for its model due to the costs of long-context serving and being the only model in town, but now o1’s pricing at $15 per million input tokens / $60 output looks out of place relative to R1’s pricing at $0.55 per million input tokens / $2.19 output (yes, o1-mini is cheaper at $3/$12 per million, but still almost a 10x difference). The price war that is coming for reasoning models will look like the Mixtral inference price war from 2023.With o3, OpenAI is likely technically ahead, but it is not generally available nor will the weights be available anytime soon. This points to the first time since Stable Diffusion’s release that the most relevant and discussed AI model is released with a very friendly license. Looking back at the journey “open-source” AI has been on over the last 2.5 years, this is a surprising moment in time marked in the history books.We don’t entirely know how these models will be used in the future beyond code and math, but noises are constantly bubbling up that OpenAI’s o1-Pro is the best model for many more challenging tasks (I need to try it myself before making definitive recommendations).The most useful post to write now is one that establishes the research area, the do’s and don’ts, and the open questions. Let’s get into the details.The DeepSeek R1 training recipe for reasoningThe training of R1 comes in 4 stages:* “Cold-start” of supervised finetuning on synthetic reasoning data from the R1-Zero model.* Large-scale reinforcement learning training on reasoning problems “until convergence.”* Rejection sampling on 3/4 reasoning problems and 1/4 general queries to start the transition to a general-purpose model.* Reinforcement learning training mixing reasoning problems (verifiable rewards) with general preference tuning reward models to polish the model.Below, the post breaks down each training stage into its core components, insights, and open questions.The winds of o1 replication have been blowing strongly away from any sort explicit search (especially at inference time). It really was, and is, a language model with the new reasoning behaviors coming from a lot of RL training.Before we start, remember that to do this reasoning training well you need a very strong base model with long-context capabilities. Much like for standard post-training, we don’t really know what traits of a base model make for one that is more suited for direct RL training.Step 0. Training R1-Zero to initialize R1 with synthetic dataDeepSeek R1 Zero will be best known as the first open model trained with “large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step.” Rumors had mentioned this for o1, but understanding how it worked wasn’t clear. This is a funky model that DeepSeek reports will sometimes change languages in reasoning or show signs of other reliability issues.The minor usability issues with R1-Zero show why more than just large-scale RL is needed to train a fantastic reasoning model, but the RL part is the key to unlocking the reasoning behaviors we are searching for.They include the most interesting results for R1-Zero, including the plot I’ve been asking for of RL-training time scaling. Since o1’s release, everyone has been obsessed with the plots showing how inference time is correlated with evaluation performance. Inference time is far easier to elicit (or force by using a framework like Monte Carlo Tree Search), but showing training time improvements via RL is the real foundational result. This is the result I’m searching for in my research.And an unsurprising, yet very satisfying plot of length growing with training. This could be mixed with the above plot to make one of the “inference time scaling” plots we have seen many versions of with less clear methods.In both of these plots, it looks like the numbers could still be going up if they let the RL cook longer. With the pace of progress so high, these laboratories get more gains by ending the jobs near saturation and starting the next experiment instead of seeking that last 1%.Most, if not all, researchers will skip the step of training an R1-Zero style model because they don’t need to. DeepSeek made it clear that their “cold start” of SFT reasoning traces makes the final R1 model better — this is unsurprising, as they want R1 to be a certain type of instruction-tuned model. It’ll help avoid some of the “RL oddities” in R1-Zero that DeepSeek mentions like changing language mid-generation.Still, the area of RL-on-base-models should be studied further. The way that R1-Zero can be trained is quite clever as most base models without any instruction tuning have a major issues with rambling and never generating a stop token. R1-Zero avoids this with a system prompt telling the model to generate HTML tags. Additionally, I suspect this type of training wouldn’t work on older base models that don’t have some standard post-training style instruction data in the pretraining corpus. For example, in OLMo 2 we had some MATH instruction data in the annealing mix. Just a few instructions will let this system prompt work.In fact, the trend of increasing generation length via RL training could be even stronger when training directly from a base model rather than a standard post-trained model that doesn’t have a verbose chain of thought style. In order for RL to really start cranking up the response length in such an instruction-following model it will have to unlearn a certain response length that was baked in. For example, in Tülu 3’s final stage of RL finetuning, the phase where the response rate first goes down could be the barrier of misalignment between a larger round of SFT training before a smaller RL setup.Zooming in on the x-axes of these R1-Zero plots, you can see that they’re doing 1000s of “RL steps.” RL step in this case refers to the model update step, which comes after multiple generations are made for the prompts in the batch and then answers are verified. This is a large amount of RL training, especially with such a large model. For reference, in our Tülu 3 work, we finetuned our models for 100s of steps normally, and the biggest models we are releasing soon only trained for ~50 steps of RL.This is scaled-up RL relative to existing literature. R1 proper surely uses a similar setup, but DeepSeek did not include the same details, so the rest of this post relies more on explicit text in the paper.Step 1. Reasoning SFT “Cold Start”In order to improve the readability (i.e. help maintain formatting) and increase the final performance of the final reasoning model, DeepSeek performs a small amount of supervised finetuning on the original base model with “a few thousand” filtered completions from the R1-Zero model. This involves a few tricks (none of which seem essential, you just need some of this data), such as:Using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.For replication efforts, any of these can be done. In fact, using DeepSeek-R1 itself is likely the easiest way.This phase readies the loss landscape of the model to make the “emergent” behaviors like “wait, let me check my work” or “that was wrong” come forth more easily in RL training.Step 2. Large-scale RL for reasoningAs a reminder, RL for reasoning models is built on a simple idea where you should reward the model for getting correct answers to problems where you can check if it has a correct answer. A basic feedback loop of this looks like the following:Exactly what the “reward” is here (the same question applies for R1-Zero) isn’t detailed. DeepSeek mentions three reward components during the reasoning phase of RL:* Accuracy rewards: These are score bonuses if the response to a prompt is correct. I’ve been referring to these as “verifiable” domains and in OpenAI’s Reinforcement Finetuning this is handled by their graders. TLDR: If the answer is correct, the reward is positive, if not, it is 0.* Format rewards: These are rewards (or penalties if not satisfied) to check and make sure that the model follows the correct formatting of or and and for stable inference.* Language consistency rewards: A reward is added to the model if the language of the answer is 100% matching the language of the question. DeepSeek writes that this additional reward shows a “slight degradation in the model’s performance,” but better human preferences. It’s added to make the model nice to use, which is a wonderful reminder that evaluation scores are not all that matters.The first reward here drives the majority of the learning and the other two are guardrails for creating a stable model (which is not to say they aren’t important implementation details, but rather that the first one is necessary and the others may not be). To optimize this reward, DeepSeek uses the RL algorithm that they introduced, Group Relative Policy Optimization, which is the PPO update rule with a different value approximation method based on Monte Carlo advantage estimates rather than holding a separate value model in memory. The most likely explanation for this choice (much like how OpenAI has always used PPO) is that it is the mature implementation in their infrastructure.This image from the DeepSeekMath paper is a fantastic comparison of PPO to GRPO (this is fine to skip this if you only care about the big picture recipe):The nature of the reward setup (and the data) is the key to this sort of reasoning training and many of the small RL details can be substituted for each other.Much like the DeepSeek V3 paper, the details of what data they used to train the model are not included here. This is absolutely crucial and almost certainly involves many, many verifiable prompts with answers. In order to study these models the community needs open versions of these datasets.I would’ve loved to see details of their RL infrastructure (similar to the details in the DeepSeek V3 paper), as many people are looking to build on these models. RL training requires holding multiple models in memory and alternating between generating, verifying, and taking loss steps. As Sasha Rush says, “We need to code up verifiers ASAP,” which is what we are trying to do at Ai2 building on Tülu 3 and could use a lot of help with the open-source code. A good approach for entities interested here is to develop tooling and data for one domain at a time.These first two steps are not new but rather scaled-up versions of ideas people have been discussing extensively. The final two steps DeepSeek details in the paper are new applications of known techniques to help take their raw reasoning performance and “train a user-friendly model.”Step 3. Rejection Sampling to introduce general abilitiesRejection sampling is a technique where you generate completions from a model, rank them via a reward model, and then finetune the original model (normally with the supervised finetuning loss) to improve performance on a variety of tasks. It’s one of the standard post-training tools used by Llama 3 and many others.DeepSeek uses rejection sampling to begin to introduce general capabilities back into the model. It is also the one stage where they include data numbers — 800K completions total, split as 600K for reasoning and 200K for general chat problems. The 800K number is not surprising to me given this is just a late-stage SFT training, but it is similar in size to the ~1M prompts we used in the Tülu 3 SFT mix which is the ballpark for leading post-training recipes.The details in the paper are largely around methods for generating responses to prompts and filtering to prioritize high-quality training data. In order to bring more domains into the scope of abilities for the model, DeepSeek has a variety of tricks, such as:* Using generative reward models (i.e. LLM-as-a-judge) to verify answers to questions that may not be explicitly verifiable,* Data from the DeepSeek-V3 standard post-training pipeline, and* Standard (nonverifiable) chat data augmented with extended chain of thought before answering to help the model generalize from reasoning training to broader use cases.All in, we currently have very few details here and there is a lot of open space to learn (and likely improve).Step 4. Final RL training for general useFinally, DeepSeek R1 goes back to reinforcement learning, which really seems to be how most finetuning is ending these days. The second RL stage is “aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities.”In order to do this, they do RL training that mixes prompts from the verifiable domains (as done for R1-Zero) and prompts for standard RLHF preference tuning. In order to do this they have multiple reward models and build upon their post-training recipe in DeepSeek V3.This is not easy to do and involves many questions: What is the right data balance? Can you use an off-the-shelf existing reward model or does it need to have seen long reasoning traces? Are there additional steps needed to not degrade performance? And so on.As this grows into a larger area of research and development these questions will slowly be answered.As this post has transitioned into the later stages of training, it is clear that many details are unknown. We have the general shape of how to sequence things and will fill in the details from here. I have a very long stack of reasoning-related research papers to poke through, and while they came before DeepSeek R1, they still will point toward answers.All of this is solvable, as proven by how quickly DeepSeek went from the o1 release to matching performance with an open weights model.Interconnects is a reader-supported publication. Consider becoming a subscriber.Discussions and next stepsThe DeepSeek R1 report has an entire other subsection dedicated to its distillation experiments, where it took completions from the R1 model and finetuned existing open-weight models with them to boost performance. This is a fantastic service for them to release this and provides a solid baseline for RL experiments on smaller models to try and match in the near future.The discussion in the paper on how large models are required to see the biggest reasoning gains (and generate effective synthetic data) is likely the biggest open question:First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.As smaller models continually improve over the years, it is likely that the same type of training could work on something like Llama 5 or 6 8B. It leaves us with the same, open question as to why different abilities “emerge” at larger models. Scaling laws are the reasons that each generation’s frontier models tend to be the largest models available. The exciting form of this question for 2025 is: How small will the slow progress of language modeling research drive advanced reasoning capabilities?Every so often a paper comes around that makes the path forward clear. The last time I felt this way was with the Llama 3 report for post-training, which solidified into the Tülu 3 paper.Soon, I’ll comment on…* Distillation of reasoning traces (as done in the R1 paper),* The demise of process reward models (PRMs) and Monte Carlo Tree Search (MCTS),* Some things in the DeepSeek paper, like the “Aha” moment and over-indexing on human priors, that annoy me,* The new reasoning research coming out from academia,* The other reasoning model that dropped yesterday — Kimi 1.5,* The biggest application of Tülu 3 RLVR yet, and* All the other ideas that are under debate in the reasoning model space.R1 is surely not the only way to train these models, but it is the recipe that people will build off immediately. Let’s get cranking on more datasets and infrastructure.For those new here, you can check out the Inference & Reasoning tag on Interconnects! Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    19:33
  • Let me use my local LMs on Meta Ray-Bans
    Full post for images, etc: https://www.interconnects.ai/p/to-meta-ray-ban-local-aiWith the Rabbit r1, the Humane pin, the Friend thing, the Sam Altman rumors, Meta Ray-Bans, and everything in between, it is obvious that we are going to get new devices in the near future driven by advancements in AI. Trying some of those that already are public makes this obvious from a functional perspective rather than a marketing perspective.Even though many of these devices will have a shelf life drastically shortened by the underlying API access getting turned off when the parent company runs out of money, the call for these devices is very strong. AI is going to be more than a chat window we use for work, we just don’t know what that will feel like. AI should be fun, flexible, and available.Meta’s Ray-Bans were first launched in 2021, long before any of this ChatGPT-inspired interest in AI began. Having tried them — the form factor would have caught on eventually, but AI was the catalyst to accelerate adoption. AI expanded our expectations for the range of exciting outcomes that could be coming our way.Using the AI in the Ray-Bans is much like using a protolithic chatbot. If I had never used ChatGPT, it would have been transformative, but today it feels slightly outdated. We should be more impressed by these generally and contextualize the AI they’re delivering. The product excitement cumulatively feels unexpectedly like what AirPods had on day 1. I was not expecting this fondness.The form factor for the Meta Ray-Bans is fantastic and drives this connection. I’ve been legitimately excited to use them (albeit, much more during sunny Seattle summers relative to now), and it immediately made sense when taking them out of the packaging. My best use has been for outdoor activities, taking photos and videos without needing to fuss with a phone and communications. An example video is below -- like most things, it has a learning curve.Here’s a photo from that outing:Or a video:Clearly, they’re fine.What I want to use them for today has nothing to do with AI. In some ways, this makes me more bullish on the form factor, but it makes it clear that Meta is in a precarious position. Ironically, I would’ve been more reluctant to buy them if not for the excitement about AI.As of writing this, I would much rather have “Apple Ray-Bans” because of a seamless integration with the rest of my information ecosystem. However, Apple may not be willing to take the risk to build them (as I avoid an Apple Vision Pro Digression).This does not mean the long-term story of many new devices won’t be the AI.AI, in the recent past (and likely in the near future), left most electronic devices with an eerie, bland sameness. My sunglasses can answer basic questions about my day just like Siri. At the same time, my appliances try to talk to me. The hard-to-visualize step is how this changes (and overcomes the same integration dead ends that agents face). AI in 5 years (or way less) will actually know the context of our lives and be able to execute basic web tasks.When the AI is good, Meta Ray-Ban type devices will be indispensable. Reminders, calls, reasoning, integration, all on the go. Much like the sensation products like AirPods provide, AI devices (and services) done right will make us free to be in the world naturally.Meta now has a real hill to climb for AI. They just need to focus on building one more useful feature at a time rather than building a god. They have a tangible goal and a real product that is going to get better in the normal march of progress. If only we had an ecosystem of people who wanted to do this work and keep hill climbing the AI part for them.The AI of the Meta Ray-Bans (and the other devices I started with) being primarily in the cloud is a drag but is needed for these first generations of glasses to maintain battery life. The cloud-centric nature of the AI is the largest perceivable reason Meta cannot open a Software Development Kit (SDK) for the glasses — all the developers would be doing is changing Meta's internal Llama API calls, rather than uploading new and improved models to the glasses.AI models in the cloud are consistently the first ones to cross the frontier of new capabilities. As we figure out what we want to use new AI devices for, using the cloud models will make us more likely than not to find useful applications. Now that we have things that people actually like, we need to optimize and specialize these models out of the cloud.What’s the state of local LMs?The AI angle for this post is to prompt the question: What do people actually use local, or on-device, language models for? What are they driving innovation of?The local model ecosystem is composed of a distribution of tinkerers, researchers, and those whom API models refuse their use cases. Most people doing this are not directly innovating on local models in a way that dictates meaningful improvements to underlying AI innovations. Yes, companies surely monitor progress and observe lessons, but there are far bigger markets at play for why local models are needed in the future of AI than the tinkerers that get visibility.Local language models are crucial for maintaining privacy (not everyone can afford fancy inference data centers like Apple), optimizing inference speed, and providing access in situations with no web connectivity. The Meta Ray-Bans stand to benefit from all of these.Phrasing the reasoning starting from the frontier, cloud models most people are used to, rather than what we want, it goes as: Local models shouldn’t try to be our general use case model. Outsource that to the cloud. Use local models for efficient, specific tasks out in the world.What local model enthusiasts are doing is building an ecosystem around optimization, latency, and task specialty that drives a lot of value. This value is captured by companies with no feedback loops to the tinkerers. Having SDKs and other direct places where those evolving local models can benefit in real ways is the goal. The models themselves will actually get better too — an actual potential feedback loop from open AI models.Just about a year ago I wrote a very similar take on local models, on how they have different trade-offs and trajectories. Apple Intelligence, Google’s new models / Pixel phones, and the Meta Ray-Bans are showing us that this future is coming.What is left to be understood is the manner in which local models are developed for new devices. Will any major technology companies let us run our own models with deep integrations? How can open-source principles and local models synergize?Hillclimbing with open, local language modelsGiving developers ways to integrate their own AI models into the operating system (OS) hooks used by the Meta Ray-Bans would immediately spawn a platform for local, open-weight language models. I first learned how locked down the Ray-Ban developer ecosystem was because I was excited to try and get our multimodal LM Molmo on them. That attempt didn’t make it far.Other companies, like Apple, could conceivably have SDKs that let users point their language models at OS hooks. Creating operating systems that allow users to integrate certain open models (even only those that are approved by the companies) would completely change the (lack of) incentives for iterating on language models in the open.While we still don’t have the new Apple Intelligence version of Siri that can plug into multiple applications, we know this works by letting an AI model generate tokens that correspond to actions in other applications. Letting users choose AI models (maybe their own), even if they only are useful in a subset of the tasks, would be wonderful. I would love to sacrifice whatever the AI situation is on my version of the Ray-Bans by default and get just the best vision model for explaining my environment, the best model for cooking ideas, or the best conversational model to just push the limits for AI devices in any of these promising directions. It would be so fun to try different AI models on a real device.The open language modeling ecosystem desperately needs these types of feedback loops (and it is totally natural for excitement about a type of technological development like this to exist before the proof cases of its value).Getting to the point where Meta has an AI SDK for devices along with the leading open language models will make their entire strategy value additive (rather than just destroying the advantages of competitors). In fact, Meta likely needs to do so, or else Apple’s product competitor may dominate the market. Only different strategies and feedback loops can dislodge Apple’s integration.On the modeling side, there’s no doubt we have step-change improvements coming to those used on the Ray-Bans. On ChatBotArena, we have many models with a few billion parameters that beat the first versions of ChatGPT. The same type of performance gain — where at 100X smaller model can match or surpass performance in a few years — will come for the Ray-Bans and all other sorts of AI applications.The big picture arc of technologyStarting in 2025, I’m excited about the breadth and quantity of profound, new technological experiences I’m having. Some of them, like ChatGPT Advanced Voice Mode, haven’t really landed for me (even though they’re extremely impressive to non-tech non-AI friends and family). Meta Ray-Bans, Waymos, Codex, and standard ChatGPT all feel like technologies that were immediately obvious as something I needed. I need to get a Starlink hub in one of the remote locations my hobbies bring me to, and I’m sure I can add reusable rockets to the transformations I’ve embraced.The last technologies sparking these joys were the likes of the iPod and the iPad.Every person I take to ride a Waymo for the first time has a similar experience of joy.This year we may also have new models that solve arbitrary internet tasks for us in the background.The future is here and we’re living in a time where it’ll be more evenly distributed. Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    10:21
  • (Voiceover) DeepSeek V3 and the actual cost of training frontier AI models
    Original post: https://www.interconnects.ai/p/deepseek-v3-and-the-actual-cost-ofChapters00:00 Opening03:15 DeepSeek’s learning efficiency06:49 DeepSeek’s compute transparency and realityFiguresFig 1: Benchmark ResultsFig 2: ChatBotArena ResultsFig 3: Compute Usage Table Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    17:06

More Technology podcastsMore Technology podcasts

About Interconnects

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai
Podcast website

Listen to Interconnects, Search Engine and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.5.1 | © 2007-2025 radio.de GmbH
Generated: 1/30/2025 - 2:34:48 PM