Each time i look at some open pretrain codebase, and i look under the hood, it ends up being a fork of megatron-lm.
Question: Is there any serious open source pretraining codebase that is not a megatron-lm derivative?
Andrew Carr proposed PyTorch's torchtitan as a potential alternative
Each time i look at some open pretrain codebase, and i look under the hood, it ends up being a fork of megatron-lm.
Question: Is there any serious open source pretraining codebase that is not a megatron-lm derivative?
Users praise torchtitan and FSDP2 as strong alternatives to Megatron-LM for open-source LLM pretraining because they provide precise control and positive overall experiences.
No Digg Deeper questions have been answered for this story yet.
@giffmana Olmo Core is from scratch, @mechanicaldirk, @epwalsh, @tyleraromero & @AkshitaB93 main contributors.
Scaled up to dense 70B models + hybrid aarch, MoE not in yet afaik.
https://github.com/allenai/OLMo-core
Each time i look at some open pretrain codebase, and i look under the hood, it ends up being a fork of megatron-lm.
Question: Is there any serious open source pretraining codebase that is not a megatron-lm derivative?
@giffmana In terms of use by people other than the org that created the library, I’m under the impression that the list is (high to low): - Megatron(-lm/-DeepSpeed/NeMO) - GPT-NeoX (Megatron-based but has moderate divergence by now) - TorchTitan - Lingua
Each time i look at some open pretrain codebase, and i look under the hood, it ends up being a fork of megatron-lm.
Question: Is there any serious open source pretraining codebase that is not a megatron-lm derivative?
@andrew_n_carr Yeah i liked torchtitan last time i looked at it! Afaik it wasn't used in any bigger open model training yet though, right?
@giffmana torchtitan doesn't build on megatron, I believe, but I'm not sure if it fits your definition of `serious` here
@giffmana torchtitan, olmo-core are great!
also worth noting that i think the nvidia team doesn't use megatron-lm to train models anymore, they use megatron bridge (which is based on megatron-core, a submodule of megatron-lm)
Each time i look at some open pretrain codebase, and i look under the hood, it ends up being a fork of megatron-lm.
Question: Is there any serious open source pretraining codebase that is not a megatron-lm derivative?
@BlancheMinerva nice list, yeah the answers seem to be converging to these four so far, plus marin/levanter for jax.
@giffmana In terms of use by people other than the org that created the library, I’m under the impression that the list is (high to low): - Megatron(-lm/-DeepSpeed/NeMO) - GPT-NeoX (Megatron-based but has moderate divergence by now) - TorchTitan - Lingua
@giffmana torchtitan doesn't build on megatron, I believe, but I'm not sure if it fits your definition of `serious` here
Each time i look at some open pretrain codebase, and i look under the hood, it ends up being a fork of megatron-lm.
Question: Is there any serious open source pretraining codebase that is not a megatron-lm derivative?
@soldni @mechanicaldirk @epwalsh @tyleraromero @AkshitaB93 Ah nice, thank you, i will have a look at it!
@giffmana Olmo Core is from scratch, @mechanicaldirk, @epwalsh, @tyleraromero & @AkshitaB93 main contributors.
Scaled up to dense 70B models + hybrid aarch, MoE not in yet afaik.
https://github.com/allenai/OLMo-core

@giffmana https://github.com/google-research/big_vision

@giffmana @stochasticchasm open the goose

@jeandut14000 Good suggestion, i should read it! Has it been used to train an open weights model? (Serious question, idk much but it)
@giffmana @soldni has anyone outside of AI2 trained a 7B+ model on OLMo-core?
@giffmana In terms of use by people other than the org that created the library, I’m under the impression that the list is (high to low): - Megatron(-lm/-DeepSpeed/NeMO) - GPT-NeoX (Megatron-based but has moderate divergence by now) - TorchTitan - Lingua

@giffmana I have something for you but it is (sadly) not open-source :/

@giffmana Lingua ?

@giffmana @jeandut14000 The models from paper were trained on it because some of our collaborators wanted to. It ended up being like 10% slower than if we had used GPT-NeoX though.
https://arxiv.org/abs/2506.05209

@giffmana I know at least @ZeyuanAllenZhu is using it to showcase his infamous Canon layers but we are in the realm of very small model sizes intended for research purposes (<=8B). I use it quite heavily myself it is a good base to fork.
@BlancheMinerva @giffmana yes, lemme DM you
@giffmana @soldni has anyone outside of AI2 trained a 7B+ model on OLMo-core?

@giffmana Built my own for our purposes. Needed a more extensible harness for the weird architectures we've made.

@giffmana @andrew_n_carr Arcee's Trinity Large and Solar 102B do!

@giffmana torchtitan is great! I am not up to date regarding MoEs performance compared to Megatron, but my experience is positive in general. There is also an AMD-optimized fork. https://github.com/pytorch/torchtitan

@TimDarcet Yeah was mostly looking at public things, but you got me curious anyways, ping me the link on corp :)