Published
Mar 2, 2026
In a previous blog post, we introduced the architecture design and practical experience of pretraining using WeLM-V4-80B-A3B as an example. This post presents a different path to improving model capability — without increasing the parameters of the Transformer backbone, we replicate the Vocab Embedding $n$ times to expand the sequence length by a factor of $n$, granting each token $n$ times the effective computation in a single forward pass.
Read more
Published
Jan 31, 2026
This blog takes the previous generation model WeLM-V3-258B-A22B as an example and shares our team’s key experience in the post-training stage.
Read more
Published
Jan 21, 2026
In this post, we share our findings on building effective pre-trained LLMs with moderate resources. We show that an 80B-A3B MoE model trained on fewer than 14 trillion tokens delivers competitive performance against similarly sized and larger systems, while a depth‑upscaled 130B variant, continued for a modest amount of training at a small learning rate, delivers a clear performance lift over the 80B baseline.
Read more
No results found.