Excited to share the latest Olmo model: Olmo Hybrid. This is a model with gated delta net (GDN) layers in a 3:1 ratio with full attention. It follows lots of other developments like Qwen 3.5 and Kimi Linear. It's incredible timing to release a fully open model so people can study how these architecture changes impact the full stack. Personally, I learned a lot in making the post-training work. Even with the data being identical for pretraining, post-training is very different! In particular, the OSS tools for these new architectures is really limited. New architectures are much slower than standard transformers or popular models like DeepSeek MoEs. This is work that we can do together to keep pushing the frontier of efficient, open models. This work was led by @lambdaviking @tyleraromero and others. I got to play a smaller part in making post-training work, super fun project! I've written up a blog post that explains why this matters and hybrid models didn't work a few years ago when Mamba was super popular. Plus, this paper is a great entry point for modern deep learning / language modeling scaling theory. Enjoy and send feedback!
@interconnectsai Much of the compute for this project was provided by @LambdaAPI. Without it, this Olmo Hybrid wouldn't exist, thank you for support of the open community.
107