On this page

In this post, we share our findings on building effective pre-trained LLMs with moderate resources. We show that an 80B-A3B MoE model trained on fewer than 14 trillion tokens delivers competitive performance against similarly sized and larger systems, while a depth‑upscaled 130B variant, continued for a modest amount of training at a small learning rate, delivers a clear performance lift over the 80B baseline.

1. Architecture

The new WeLM series employs a highly sparse MoE architecture (Shazeer et al. 2017; Fedus et al. 2022) designed for computational efficiency during training and inference. The MoE layers integrate loss-free balance routing (DeepSeek-AI et al. 2024) with unnormalized sigmoid gates and one shared expert (Dai et al. 2024). Following the Qwen series (Team 2024), the model adopts a deeper architecture stabilized through the combination of PostNorm and OutputNorm (Zhuo et al. 2025; Ding et al. 2021).

The attention mechanism consists of Grouped-Query Attention (Ainslie et al. 2023) with partial RoPE, triple the standard number of attention heads combined with headwise gating (Qiu et al. 2025), and KNorm for attention logit stabilization (DeepSeek-AI 2024). KV-Mirror shares hidden states between shallow and deep layers for KV activation generation (Sun et al. 2024; Wu et al. 2024), reducing inference costs. The model incorporates a 2-head over-encoding module, which reduces training loss without increasing inference overhead. An additional MoE layer serves as the Multi-Token Prediction (MTP) layer (Gloeckle et al. 2024), enabling speculative decoding during inference (DeepSeek-AI et al. 2024). The 130B model variant is derived from the 80B base through Depth Up-Scaling (Kim et al. 2023), which cyclically duplicates the middle one-third of layers twice.

WeLM Architecture Figure 1: WeLM Architecture Overview

Model WeLM-80B WeLM-130B
# Total Parameters 80B 130B
# Activated Parameters 3B 4.9B
# Embedding Size 6.1B 6.1B
# MoE Layers 48 78
# MTP Layers 1 1
Hidden Dim 2048 2048
MoE Intermediate Dim 512 512
Attention Head Dim 256 256
# Attention Heads 24 24
# Key-Value Heads 2 2
# Experts (total) 512 512
# Experts Active Per Token 10 10
# Shared Experts 1 1

Table 1: Model architecture of WeLM. We exclude word embeddings and the output layer in activated and total parameters.

Normalization

The model employs partial PostNorm to mitigate representation collapse in deeper layers (Zhuo et al. 2025). Normalization is applied to both the residual and hidden states before attention, while only the input hidden states are normalized before MoE layers. In PostNorm architectures, normalizing the residual causes each layer’s residual norm to become comparable to the MoE and attention outputs, increasing the likelihood that the RMS of the residual sum reaches extremely small values. This phenomenon triggers abnormal gradient amplification during the backward pass of subsequent RMSNorm operations, leading to model collapse (Zhuo et al. 2025). To mitigate this instability, OutputNorm (Ding et al. 2021) is applied to the attention output with small weight initialization, stabilizing training during early stages.

KV-Mirror

Standard decoder-only LLM architectures (Radford et al. 2019; Brown et al. 2020) require the prefill stage to complete computations across all layers to generate the full KV cache for subsequent decoding. YOCO demonstrates that computational early exit during prefill is achievable through a self-decoder structure where consecutive output layers reuse shallow-layer KV cache, reducing prefill computation by half at arbitrary sequence lengths (Sun et al. 2024). KVSharer further shows that KV cache sharing between attention layers with higher divergence yields improved performance (Wu et al. 2024).

Building on these insights, WeLM implements a U-shaped KV sharing strategy: the first third of layers mirror into the last third (including the MTP layer). Instead of directly reusing the cached K/V tensors, our variant reuses the pre-projection hidden states from the mirrored layer and recomputes K/V with the target layer’s projection. This modification allows the mirror layer to represent distinct semantics while simultaneously preserving the early exit capability of the KV cache method, leading to superior performance in our benchmarks compared to direct K/V reuse.

Concretely, the mirrored layer i loads the saved hidden states from layer j before K/V projection, projects them with W_i, and replaces the K/V portion of its own QKV:

# Imitated layer j
save_hidden(j, h_j)

# Mirror layer i
K_i, V_i = W_i^{k,v} * h_j
# Q_i computed as usual (omitted)

This approach reduces inference costs while suppressing attention-sink phenomena in deeper layers. However, we find incorporating shallow-layer hidden states into Q-projection inputs does not yield observable improvements.

Over-encoding

Over-encoding (Huang et al. 2025) demonstrates that input vocabulary expansion yields consistent performance improvements while constraining total memory access during decoding. WeLM adopts a multi-head hashing mechanism (Svenstrup et al. 2017) to mitigate hash collisions in n-gram settings, enabling effective performance under relatively small vocabulary size constraints. The technique shows enhanced effectiveness on architectures with cross-layer activation sharing, such as KV-Mirror.

Experiments

To evaluate the individual and cumulative impact of these architectural refinements, we conduct a series of ablation studies using a scaled-down MoE configuration (8.7B total parameters, 0.7B activated parameters). The results, summarized in Table 2, demonstrate consistent improvements across both English and Chinese benchmarks as each component is integrated.

Ablation results

28-layer, 8.7B parameters, 0.7B activated, 377B tokens, AdamW, batch size = 1024

Method Max LR MMLU MATH C-Eval CMMLU
(1) Baseline 8.0 x 10^-4 49.37 13.96 54.83 52.90
(2) +KV-Mirror 8.0 x 10^-4 53.29 16.8 56.82 58.06
(3) +PostNorm and OutputNorm 8.0 x 10^-4 55.32 17.38 58.78 60.83
(4) +Over-encoding 3M 8.0 x 10^-4 56.69 18.74 63.37 64.40

Table 2: Ablation results on 8.7B configuration.

To further verify the scalability of these modifications, we scaled the experiment to a 30.7B model with 3B activated parameters. As shown in Table 3, the integration of these architectural improvements leads to significant performance gains across a wide range of benchmarks, including language understanding, reasoning, and coding tasks. This broader evaluation confirms that the benefits observed in smaller-scale experiments translate effectively to larger model variants.

Benchmark (Metric) # Shots 30B Baseline 30B Proposed
# Activated Parameters - 3B 3B
# Total Parameters - 30.7B 30.7B
# Embedding Size - 0.32B 3.39B
# Training Tokens - 1.6T 1.6T
Pile-test (BPB) - 0.43 0.41
BBH (EM) 3-shot 65.99 70.43
MMLU (EM) 5-shot 72.37 74.93
CMMLU (EM) 5-shot 75.9 78.76
C-Eval (EM) 5-shot 73.78 77.13
SimpleQA (EM) 5-shot 4.32 4.34
Chinese SimpleQA (EM) 5-shot 35.18 36.85
HumanEval+ (Pass@1) 1-shot 34.8 36.6
MBPP+ (Pass@1) 1-shot 54.4 59.4
GSM8K (EM) 4-shot 85.29 87.72
MATH (EM) 4-shot 38.54 39.62

Table 3: Ablation results on 30B configuration.

2. Dataset

The training corpus covers 60+ languages, features extensive quality filtering, and incorporates modern LLMs into the preprocessing pipeline.

LLM Quality Grader We employ an LLM-based quality grader with a suite of engineered prompts to evaluate content quality across diverse sources. By scoring a representative fraction of samples from each data stream, we establish a robust quality metric that guides our filtering thresholds. We find this approach highly efficient, demonstrating a high correlation with more computationally expensive methods, such as specifically fine-tuned classifiers, while allowing for faster iteration on new data sources.

OCR Pipeline We find that incorporating a Vision-Language Model (VLM) for document OCR is an effective approach, as it achieves higher extraction accuracy and enables faster iteration compared to traditional methods. We utilize an in-house version of POINTS-Reader (Liu et al. 2025) to convert raw PDF documents into a clean textual format.

Encyclopedic Data Curation To enhance the model’s factual knowledge and ensure semantic diversity, we employ multiple pipeline variants for processing encyclopedic data. By maintaining a curated mixture of outputs from these distinct pipelines, we maximize the breadth of information while mitigating the systematic biases potentially introduced by any single processing methodology.

3. Training

The pre-training process is governed by several key optimization and scheduling choices:

  • Optimization: We employ the Muon optimizer (Jordan et al., n.d.) with a momentum coefficient of 0.95, 5 iteration steps, and a weight decay coefficient of 0.1. We use a cosine learning rate schedule for the first stage, with a maximum learning rate of 5e-4 and an end learning rate of 5e-5.
  • Training Schedule: The 80B model is pre-trained on 11T tokens in the primary stage, followed by an annealing phase on 1.4T high-quality tokens. A final stage of context extension on 1.3T tokens is performed to achieve a 128K context length.
  • Model Scaling: The 130B variant is initialized from an 80B checkpoint that has completed a 13.7T token primary training stage. Following Depth Up-Scaling, the 130B model follows a similar progression of annealing and context extension as the base 80B model.
  • Auxiliary Loss: All models integrate a Multi-Token Prediction (MTP) layer with the loss weight fixed at 0.3.

4. Infrastructure

The WeLM series is trained on an internally developed framework optimized for large-scale sparse models. To maintain high throughput and manage memory constraints on moderate cluster sizes, we implement several specialized communication and memory management strategies:

  • DualPipeV and DeepEP: To address communication bottlenecks in expert parallelism, we utilize DualPipeV and DeepEP (DeepSeek-AI et al. 2024). These techniques overlap computation with communication, effectively eliminating bubbles introduced by expert parallelism communication.

  • Kernel Fusion: With the increasing computational power of GPUs, CPU overhead becomes a significant bottleneck. We implement a Grouped GEMM CUDA kernel to consolidate matrix multiplications across all experts, mitigating kernel launch overhead. Furthermore, we fuse lightweight operators, such as Norm and RoPE, to reduce memory access latency and CPU-side dispatch costs.

  • Activation Memory Optimization: Increasing the Micro Batch Size (MBS) to improve GPU utilization is often limited by activation memory overhead. We address this through:

    • Selective Quantization and Recomputation: We maintain BF16 precision as the primary format while selectively applying FP8 quantization or recomputation. Activation checkpoints are strategically placed at layers like QKV projections. To mitigate the memory overhead from top-k hidden state copies in MoE blocks, we implement recomputation for the permutation stage, triggered only during the backward-for-weights phase. Additionally, we apply tile-wise quantization to SwiGLU activations.
    • Fused SwiGLU and MulProbs: We optimize the unpermuting calculation order by moving it after the MoE SwiGLU calculation and fusing them into a single kernel. This allows the tensor to be released and recomputed during the backward pass, significantly reducing activation memory overhead.

5. Evaluation

Public Benchmarks

We evaluate the two models on a comprehensive suite of benchmarks to assess performance across multiple dimensions, including reasoning, multilingual proficiency, and long-context handling. The public benchmarks used for this evaluation are summarized below:

All open-source models in the table are self-hosted and evaluated with the same pipeline to ensure a controlled comparison. The two Qwen models (Qwen3-32B and Qwen3-Next 80B-A3B) are exceptions: their benchmark numbers are taken directly from the Qwen3 Technical Report and Qwen’s release blog because the weights or APIs are not publicly accessible, so we do not re-evaluate them. (Yang et al. 2025; Qwen Team, n.d.)

Type Benchmark WeLM
80B-A3B
WeLM
130B-A4.9B
GLM-4.5
Air
Qwen3-Next
80B-A3B
Qwen3
32B
Qwen3
30B-A3B
Nemotron3
Nano 30B-A3B
English MMLU 85.65 86.91 80.61 84.72 83.61 81.09 76.91
MMLU-Redux 84.25 86.12 80.72 83.8 83.41 80.75 74.96
MMLU-Redux2.0 87.77 89.47 84.94 - - 85.21 80.46
MMLU-Pro 65.77 67.79 55.15 66.05 65.54 60.99 60.97
GPQA 45.45 46.46 36.87 43.43 49.49 36.87 37.88
SuperGPQA 43.31 44.41 34.94 41.52 39.78 35.7 31.51
BBH 85.47 88.05 82.28 87.13 87.38 81.15 78.77
BBH-Fix 87.9 90.4 84.75 - - 81.37 80.16
KOR-Bench 42.56 45.92 42.72 - - 44.56 35.52
MATH 60.04 58.78 65.63 62.36 61.62 60.16 65.48
MATH-New-CoT 85.66 85.06 81.48 - - 83.08 81.5
GSM8K 87.72 90.9 86.2 90.3 93.4 89.99 87.57
GSM8K-New-CoT 96.66 95.57 96.58 - - 96.58 93.71
SimpleQA 17.18 18.1 15.02 - - 6.93 9.41
AA-Omniscience 21.36 23.73 23.73 - - 17.97 17.8
Chinese C-Eval 89.23 90.42 82.99 - - 88.22 63.3
CMMLU 89.9 90.11 83.91 - - 86.47 64.45
C-SimpleQA 61.04 63.58 63.58 - - 51.37 32.32
Multilingual MMMLU 71.7 74.33 64.11 - - 66.36 36.91
INCLUDE 72.9 76.42 67.26 69.79 67.87 67.65 43.82
Global-MMLU 78.94 80.92 72.24 - - 69.87 63.42
Code EvalPlus 72.6 71.13 78.5 72.89 72.05 72.63 78
MultiPL-E 65.37 70.48 53.51 - 67.06 65.89 63.41
CRUX-Input 68.8 65.1 62.3 - - 59.4 62.75
CRUX-Output 86.6 81.1 66.75 74.25 72.5 66.1 63.5
Long
Context
RULER-4k 96.47 96.64 96.61 - - 94.65 95.1
RULER-8k 94.90 96.44 95.26 - - 92.27 93.75
RULER-16k 95.11 96.51 93.98 - - 90.27 92.58
RULER-32k 94.69 96.24 92.09 - - 87.36 89.22
RULER-64k 92.13 94.96 84.02 - - - 86.49
RULER-128k 88.09 91.88 62.29 - - - 81.65
RULER-avg 93.57 95.45 87.38 - - - 89.80
LongBenchV2-32k 48.31 47.46 30.51 - - 38.98 33.05
LongBenchV2-128k 34 35 25.5 - - - 30.5
LongBenchV2-avg 41.16 41.23 28.01 - - - 31.78
MRCRv2 (<32k) 37.72 47.55 34.8 - - 33.77 29.37
MRCRv2 (<128k) 33.42 42.15 30.08 - - - 24.58
MTOB 57.45 64.75 43.15 - - - 33.65
Table 4: Comparison of WeLM with representative open-source models.

The WeLM variants exhibit competitive performance across the evaluated benchmarks. Several key observations from these results are detailed below:

  • Effectiveness of Depth Up-Scaling: Although the 130B model was initialized from a late-stage 80B checkpoint and trained with a reduced learning rate schedule, it achieves significant performance gains across most benchmarks. These improvements are particularly pronounced in tasks involving complex reasoning and broad factual knowledge.

  • Impact of LLM-Generated CoT Exemplars: Traditional math benchmarks often utilize human-labeled short answers as few-shot exemplars, which may not fully elicit the latent reasoning capabilities of modern pre-trained models. By introducing GSM8K-New-CoT and MATH-New-CoT—variants that utilize LLM-generated Chain-of-Thought (CoT) solutions—we observe substantial performance improvements across all evaluated models, with gains of 3–8 absolute points on GSM8K and 2–5 points on MATH. These results suggest that evaluation methodologies must evolve alongside model capabilities to accurately capture advancements in reasoning proficiency.

Internal Benchmarks

To further analyze model performance on contemporary data and more challenging problem sets, we developed several internal benchmarks. These datasets are designed to reduce the risk of data contamination and provide a more granular assessment of model capabilities:

  • GPQA-Bench (5-shot): A graduate-level reasoning benchmark consisting of 1.1K original problems across multiple disciplines. Each item is structured as a five-choice multiple-choice question.
  • Web-SimpleQA & Web-CSimpleQA (15-shot, LLM judge): Utilizing an internal deep-research pipeline, we constructed these benchmarks to evaluate world knowledge mastery in English and Chinese, respectively. These datasets feature significantly broader domain coverage compared to the original SimpleQA.
  • Web-MMLU (5-shot): A multiple-choice benchmark similar to MMLU but featuring entirely distinct topics and questions generated through our internal research pipeline.
  • WMT-Bench (5-shot): A machine translation evaluation set derived from the WMT24-pp corpus. We utilize the XCOMET-XXL metric for automated translation scoring.
Benchmark GLM-4.5
Air
Qwen3
30B-A3B
Nemotron3
Nano 30B-A3B
WeLM
80B-A3B
WeLM
130B-A4.9B
GPQA-Bench 61.57 61.66 55.74 65.57 68.21
Web-SimpleQA 46.75 35.93 42.17 45.09 46.67
Web-CSimpleQA 47.03 38.98 32.55 47.44 49.83
Web-MMLU 83.56 84.68 86.14 89.47 90.17
Table 5: Performance on internal knowledge and reasoning benchmarks.

WMT-Bench Translation Performance Figure 2: performance on the WMT-Bench

Entropy Decrease of the Depth Up-Scaling Model

The depth up-scaled model often exhibits different output patterns from the base checkpoint in practice. To quantify this effect, we compare token-level output entropy across domains for the 80B-A3B and 130B-A4.9B variants, as summarized below.

The 130B-A4.9B model shows lower mean entropy in English, code, and math, while Chinese is slightly higher. This suggests reduced output diversity in most domains, with only modest changes in variance. We leave a deeper analysis of this behavior to future work.

Domain Model Avg. Entropy Entropy Std.
Chinese 130B-A4.9B 0.8162 0.4497
80B-A3B 0.8009 0.4369
English 130B-A4.9B 0.9093 0.4666
80B-A3B 0.9882 0.4605
Code 130B-A4.9B 0.6196 0.2390
80B-A3B 0.6417 0.2426
Math 130B-A4.9B 0.2841 0.1894
80B-A3B 0.2983 0.2117
Table 6: Comparison of model output entropy in different domains.

6. Conclusion and Future Outlook

In this post, we have detailed the architectural refinements and data curation strategies that enable the development of high-performance LLMs within moderate computational constraints. Our findings demonstrate that the combination of targeted structural optimizations—such as KV-Mirror and stabilized normalization—and rigorous, LLM-guided data quality control can significantly narrow the performance gap between moderately sized models and their larger, more computationally expensive counterparts. We believe these methodologies offer a practical roadmap for developing efficient, capable models without requiring extreme-scale resources. We will share more insights in future posts. Stay tuned.

Appendix

Empirically, we observe that high-quality pre-trained models often demonstrate superior performance on open-ended and out-of-domain writing tasks, reflecting robust generalization. We made an internal Write-Bench and present sample outputs from the WeLM-80B model below.

Output Cases

Output Case 1 Classical Chinese prose
Prompt
以华山为主题,写一篇古文。输出格式:
<文章开始>
...
<文章结束>

<文章开始>
Response
太华之巔,星斗垂光,孤峰接汉。其势若巨鳌戴云,鲸吸玄溟。古柏虬结于危崖,悬泉迸珠于绝壑。晨观五色霞霓,暮聆万壑松涛。登顶者,仰见帝座,俯吞八荒,顿觉天壤为之一空。  

忆昔女娲断鳌立极,始有坤轴之隆;轩辕铸鼎觞天,常驾苍龙之驾。秦皇跃马巡疆,勒石以铭功德;汉武筑台封禅,燃薪而见灵光。迨至唐李翰林,以剑决浮云,诗锋裂苍冥;宋范希文,忧乐关天下,墨痕留玉泉。  

然则仙踪杳渺,空余采药人迹;劫火澹然,徒存伏羲画卦。独千仞青翠,送风送月;一泓凝碧,鉴古鉴今。是知山不在高,有仙则名;峰不在险,有志则尊。  

嗟乎!游子临渊,欲掬星而不可得;志士怀远,望云涛而浩然生。纵使苍颜白发,犹乘鹤陟层巅;即令海竭江枯,永镇神州志气。

<文章结束>
Output Case 2 English poem
Prompt
Write a poem about vibe coding.
Output format:
<text>
...
</text>

<text>
Response
VIBE CODING  

The cursor blinks—a silent drum,  
Averse to rules I never come  
To learn, but follow instinct's hum—  
Vibe coding, a frantic *ehem*  

No tutorials to trace or grind,  
No flags to catch the faulty kind.  
Just run, observe, and stumble blind,  
Where compilers used to shake in mind.  

ChatGPT speaks in seraph tones,  
While Gemini with midnight moans  
Suggests fixes that slice and groans  
Through logic’s fragile, trembling bones.  

I *prompt* a thing that *looks* like code—  
A fragile, formless, flowery load.  
It *runs*! With errors splayed abroad,  
Like ships that have lost every mode.  

The AI hushes like a monk,  
Then spills a daemon in the tank.  
I jiggle brackets, curse, then—clank—  
A flicker, then the whole thing *whack*!  

No need to read what’s on the screen;  
Just *feel* the flow—a fever, clean.  
The stack overflows in neon green,  
Like digital ice on summer’s keen.  

We don’t debug; we *are* the bug.  
We dance through errors in a hug  
Of chaos—fed by caffeine’s tug—  
And pray the latency won’t shrug  

Us off into the void of servers,  
Where ghosts of failed APIs sever  
The tether. Oh, the sweet convergences  
When source code sings through strange urgences!  

We trust the magic, lean it hard—  
For logic’s kiss is not its card.  
The code? It works. It has the *lard*  
Of wonder, chaos, and some shard  
Of grace. The machine learns what we dare—  
And spits back verse in electric air.  

—after midnight, the monitor’s sigh,  
A code poem that will never die.
</text>

References

Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” arXiv Preprint arXiv:2305.13245.

Bai, Yushi, Shangqing Tu, Jiajie Zhang, et al. 2025. “Longbench V2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 3639–64.

Brown, Tom, Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems (NeurIPS) 33: 1877–901.

Cassano, Federico, John Gouwar, Daniel Nguyen, et al. 2023. “Multipl-e: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation.” IEEE Transactions on Software Engineering 49 (7): 3675–91.

Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, et al. 2021. “Training Verifiers to Solve Math Word Problems.” arXiv Preprint arXiv:2110.14168.

Dai, Damai, Chengqi Deng, Chenggang Zhao, et al. 2024. “Deepseekmoe: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models.” arXiv Preprint arXiv:2401.06066.

DeepSeek-AI. 2024. “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” arXiv Preprint arXiv:2405.04434.

DeepSeek-AI et al. 2024. “DeepSeek-V3 Technical Report.” arXiv Preprint arXiv:2412.19437.

Ding, Ming, Zhuoyi Yang, Wenyi Hong, et al. 2021. “Cogview: Mastering Text-to-Image Generation via Transformers.” NeurIPS.

Du, Xinrun, Yifan Yao, Kaijing Ma, et al. 2025. “Supergpqa: Scaling Llm Evaluation Across 285 Graduate Disciplines.” arXiv Preprint arXiv:2502.14739.

Fedus, William, Barret Zoph, and Noam Shazeer. 2022. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” The Journal of Machine Learning Research 23 (1): 5232–70.

Gema, Aryo Pradipta, Joshua Ong Jun Leang, Giwon Hong, et al. 2025. “Are We Done with Mmlu?” Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 5069–96.

Gloeckle, Fabian, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. 2024. “Better & Faster Large Language Models via Multi-Token Prediction.” arXiv Preprint arXiv:2404.19737.

Gu, Alex, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. 2024. “Cruxeval: A Benchmark for Code Reasoning, Understanding and Execution.” arXiv Preprint arXiv:2401.03065.

He, Yancheng, Shilong Li, Jiaheng Liu, et al. 2025. “Chinese Simpleqa: A Chinese Factuality Evaluation for Large Language Models.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 19182–208.

Hendrycks, Dan, Collin Burns, Steven Basart, et al. 2021. “Measuring Massive Multitask Language Understanding.” International Conference on Learning Representations.

Hendrycks, Dan, Collin Burns, Saurav Kadavath, et al. 2021. “Measuring Mathematical Problem Solving with the Math Dataset.” arXiv Preprint arXiv:2103.03874.

Hsieh, Cheng-Ping, Simeng Sun, Samuel Kriman, et al. 2024. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” arXiv Preprint arXiv:2404.06654.

Huang, Hongzhi, Defa Zhu, Banggu Wu, et al. 2025. “Over-Tokenized Transformer: Vocabulary Is Generally Worth Scaling.” arXiv Preprint arXiv:2501.16975.

Huang, Yuzhen, Yuzhuo Bai, Zhihao Zhu, et al. 2023. “C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models.” Advances in Neural Information Processing Systems 36: 62991–3010.

Jackson, Declan, William Keating, George Cameron, and Micah Hill-Smith. 2025. “AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models.” arXiv Preprint arXiv:2511.13029.

Jordan, Keller, Yuchen Jin, Vlado Boza, et al. n.d. “Muon: An Optimizer for Hidden Layers in Neural Networks, 2024.” URL Https://Kellerjordan. Github. Io/Posts/Muon 6.

Kim, Dahyun, Chanjun Park, Sangdoo Kim, Wonsung Lee, Sunghyun Kim, and Yungi Ahn. 2023. “Solar 10.7b: Scaling Large Language Models with Simple Depth up-Scaling.” arXiv Preprint arXiv:2312.15166.

Li, Haonan, Yixuan Zhang, Fajri Koto, et al. 2024. “Cmmlu: Measuring Massive Multitask Language Understanding in Chinese.” Findings of the Association for Computational Linguistics: ACL 2024, 11260–85.

Liu, Jiawei, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. “Is Your Code Generated by Chatgpt Really Correct? Rigorous Evaluation of Large Language Models for Code Generation.” Advances in Neural Information Processing Systems 36: 21558–72.

Liu, Yuan, Zhongyin Zhao, Le Tian, et al. 2025. “POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion.” Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 1576–601.

Ma, Kaijing, Xinrun Du, Yunran Wang, et al. 2024. “Kor-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks.” arXiv Preprint arXiv:2410.06526.

OpenAI. 2024. Multilingual Massive Multitask Language Understanding (MMMLU). Https://huggingface.co/datasets/openai/MMMLU{.uri}.

Qiu, Zihan et al. 2025. “Gated Attention for Large Language Models: Non-Linearity, Sparsity, and Attention-Sink-Free.” arXiv Preprint arXiv:2505.06708. https://arxiv.org/abs/2505.06708.

Qwen Team. n.d. Qwen3-Next 80B-A3B Release Blog. Https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd{.uri}.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog 1 (8): 9.

Rein, David, Betty Li Hou, Asa Cooper Stickland, et al. 2024. “Gpqa: A Graduate-Level Google-Proof q&a Benchmark.” First Conference on Language Modeling.

Romanou, Angelika, Negar Foroutan, Anna Sotnikova, et al. 2024. “Include: Evaluating Multilingual Language Understanding with Regional Knowledge.” arXiv Preprint arXiv:2411.19799.

Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, et al. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” arXiv Preprint arXiv:1701.06538.

Singh, Shivalika, Angelika Romanou, Clémentine Fourrier, et al. 2025. “Global Mmlu: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 18761–99.

Sun, Yutao, Li Wang, Yaru Cui, et al. 2024. “You Only Cache Once: Decoder-Decoder Architectures for Language Models.” arXiv Preprint arXiv:2405.05254.

Suzgun, Mirac, Nathan Scales, Nathanael Schärli, et al. 2023. “Challenging Big-Bench Tasks and Whether Chain-of-Thought Can Solve Them.” Findings of the Association for Computational Linguistics: ACL 2023, 13003–51.

Svenstrup, Dan T, Jonas Hansen, and Ole Winther. 2017. “Hash Embeddings for Efficient Word Representations.” Advances in Neural Information Processing Systems (NeurIPS) 30.

Tanzer, Garrett, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. 2023. “A Benchmark for Learning to Translate a New Language from One Grammar Book.” arXiv Preprint arXiv:2309.16575.

Team, Qwen. 2024. “Qwen2.5 Technical Report.” arXiv Preprint arXiv:2412.15115.

Vodrahalli, Kiran, Santiago Ontanon, Nilesh Tripuraneni, et al. 2024. “Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries.” arXiv Preprint arXiv:2409.12640.

Wang, Yubo, Xueguang Ma, Ge Zhang, et al. 2024. “Mmlu-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.” Advances in Neural Information Processing Systems 37: 95266–90.

Wei, Jason, Nguyen Karina, Hyung Won Chung, et al. 2024. “Measuring Short-Form Factuality in Large Language Models.” arXiv Preprint arXiv:2411.04368.

Wu, Yifan, Jinda Li, Hantian Mao, et al. 2024. “KVSharer: Efficient Inference via Layer-Wise Dissimilarity-Based KV Cache Sharing.” arXiv Preprint arXiv:2410.16584.

Yang, An, Anfeng Li, Baosong Yang, et al. 2025. Qwen3 Technical Report. https://arxiv.org/abs/2505.09388.

Zhuo, Zhijian, Yutao Zeng, Ya Wang, et al. 2025. HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization.