Research notes and system design briefs.

Published by the WeChat AI team.

Published Mar 2, 2026

Hidden Decoding: Scaling Sequence Length in Pretraining

In a previous blog post, we introduced the architecture design and practical experience of pretraining using WeLM-V4-80B-A3B as an example. This post presents a different path to improving model capability — without increasing the parameters of the Transformer backbone, we replicate the Vocab Embedding $n$ times to expand the sequence length by a factor of $n$, granting each token $n$ times the effective computation in a single forward pass.

Published Jan 31, 2026

Preliminary Exploration of WeLM-258B MOE Model Post-Training

This blog takes the previous generation model WeLM-V3-258B-A22B as an example and shares our team’s key experience in the post-training stage.

Published Jan 21, 2026

Building Effective Sparse MoE Models with Moderate Resources

In this post, we share our findings on building effective pre-trained LLMs with moderate resources. We show that an 80B-A3B MoE model trained on fewer than 14 trillion tokens delivers competitive performance against similarly sized and larger systems, while a depth‑upscaled 130B variant, continued for a modest amount of training at a small learning rate, delivers a clear performance lift over the 80B baseline.