On this page

In the previous blog, we use WeLM-V4-80-A3B as an example to introduce our team’s recent experience in the pre-training stage.

This blog uses an earlier generation base model, i.e., WeLM-V3-258B-A22B MoE, as an example to show our experience in the post-training stage.

The post-training stage of WeLM-V3-258B-A22B MoE includes a cold start SFT stage and a RL training stage. In the cold start SFT stage, the model uses diverse, high-quality instruction data to establish initial capabilities to follow human instructions and generate coherent responses. It then proceeds to the RL training stage, we combine RL verifier, reward models, and strategy optimization algorithms to further enhance the model’s performance in intelligent reasoning and usefulness. The entire post-training process emphasizes data quality, training stability, and multi-objective optimization, eventually show the competitiveness of the WeLM-V3-258B-A22B MoE model in various scenarios such as mathematics, logical reasoning, knowledge, Q&A, instruction following, multi-turn dialogue, and role-playing.

1. Pre-training

Compared to the latest WeLM-V4 models, WeLM-V3-258B-A22B retains a more traditional architecture, integrating only a few features such as Key-Norm (introducing normalization after attention key projection) and half RoPE (adding relative position encoding only in parts of the dimensions), making its architecture efficiency slightly inferior to the latest architecture.

WeLM-V3-258B-A22B is trained based on a cluster with a maximum size of 1536 Nvidia H800 GPUs. Evaluation results in the self-deployed equivalent environment show that the WeLM-V3-258B-A22B Base exhibits performance comparable to the DeepSeek-V3 series pre-training models.

Category Remark Benchmark DeepSeek-V3 DeepSeek-V3.1 WeLM-258B
English MMLU 87.4 88.0 87.3
MMLU-Redux 87.1 88.0 86.4
MMLU-Redux2.0 89.9 90.4 90.2
MMLU-Pro 64.3 65.5 71.8
SuperGPQA 44.2 45.1 51.7
SimpleQA 27.4 27.3 29.8
BBH-Fix 90.8 91.5 90.8
Chinese C-Eval 90.0 90.4 90.3
CMMLU 88.8 89.0 91.0
C-SimpleQA 72.6 72.1 73.4
Math MATH 54.2 57.0 61.2
GSM8K 95.2 94.8 94.5
Code Evaluate with zero-shot and 1-shot approaches respectively, and select the higher value. EvalPlus 70.8 69.9 73.9
MultiPL-E 62.9 64.9 73.4
CRUX-Input 63.6 61.5 67.3
CRUX-Output 76.4 74.5 84.5
BigCode-full 51.8 53.2 48.9
BigCode-hard 22.3 23.0 27.7
LiveCodeBench-v6 24.2 26.4 37.4
Long Sequence RULER-4k 97.7 97.8 97.3
RULER-8k 96.8 97.2 97.0
RULER-16k 96.4 96.7 96.8
RULER-32k 95.4 96.2 96.1
RULER-64k 93.4 94.8 95.0
RULER-128k 90.9 92.9 92.2
Best performance on each benchmark is in bold and the second best is underlined

2. Cold Start Training

LLMs’ post-training experience generally follows a basic assumption: a model achieving better performance metrics in the SFT stage can also attain more outstanding final results in the subsequent reinforcement learning (RL) training stage. Recently, Kang et al. (Kang et al. 2025) find that excessive SFT training, despite showing “better” results in evaluation metrics, may compress the strategy exploration space of the model; however, the effectiveness of RL training heavily relies on the model’s exploration of new strategies and excavation of better strategies. Therefore, LLMs over-trained during the SFT stage with “solidified” thinking patterns struggle to achieve performance breakthroughs in the RL stage. Based on this, we set the training target for the SFT stage as: constructing a cold start model with strong exploration potential. The target poses core requirements for SFT training data: diversity of both instructions and CoT (chain-of-thought) responses.

Diversity of Instructions

To ensure the diversity of instructions, we construct standardized data labeling and filtering processes to build the instruction set centered on high quality and wide coverage. First, we perform quality controls on the instruction samples, filter low-quality samples with ambiguous semantics and shallow instructions, to ensure basic data reliability. Then, we perform open domain label classification on quality-assured instructions, filtering instructions with balanced label distribution and sufficient diversity based on pre-defined thresholds. Finally, to further reduce semantic redundancy among all instructions, we perform semantic clustering and bucket processing on all instructions, and extract instructions from each cluster by fixed quantity to ultimately form our instruction set. In this way, the diversity of instructions can be satisfied. Moreover, to avoid data leakage, we conduct comprehensive data leakage detection at both document and clause granularity using MinHash-LSH(Broder 1997) and semantic vector-based similarity filtering methods separately, strictly avoiding data pollution issues to ensure the independence and validity of the instruction set.

Diversity of CoT Responses

To let the model acquire problem-solving ability from multiple perspectives during the cold start SFT stage, we simultaneously conduct quality filtering and bucket processing on CoT responses. Meanwhile, we propose the ConCISE technique(Qiao et al. 2025) to trim redundant thinking and verification steps that arise from inference uncertainty in the CoT responses. In this way, the instruction-tuned model moves beyond the traditional ‘calculation-verification-confirmation-output’ inference paradigm, enabling it to employ more versatile strategies for addressing different types of instructions.

Based on the above methods, we finally construct a SFT dataset in the scale of hundreds of thousands, with its task and length distribution illustrated below:

Distribution of SFT data tasks and length

Figure 1: Distribution of SFT data tasks and length.

Additionally, we randomly sample 10K samples for quantifying their thinking patterns (see Figure 2). Results show that the thinking patterns can develop into a non-linear complex thinking network with highly divergent paths based on full-path coverage of 14 logical dimensions. From the topological feature analysis, its graph density reaches 0.85, and the average out-degree is close to 12, indicating deep interaction among reasoning operators, comprehensively capturing the rich diversity of CoT responses and the adaptable logical decision-making capabilities inherent in the dataset.

Topology map of the sampled SFT data’s CoT

Figure 2: Topology map of the sampled SFT data’s CoT.

3. RL Training

At the RL training, we use GRPO(Shao et al. 2024) optimization algorithm, and we employ a two-stage RL training process, training mathematical tasks first, then mixed tasks. Next, we will share our attempts and findings in data optimization and stability RL training strategies.

Mathematical Tasks

The core of building a strong general LLM is to break through the limitation of task specificity and achieve cross-domain reasoning and decision-making capability transfer. Mathematical tasks have characteristics of strong structuralization, clear logic, and complete chain of thoughts, which can effectively enhance the LLM’s reasoning ability and decision-making rigor through mathematical RL, thereby assisting the model in improving its generalization performance in other tasks. Based on this, the optimization of the first stage RL training is concentrated on mathematical tasks.

Data

We divide the RL training data for mathematical tasks into four categories based on answer presentation form: multiple choice questions, judgment questions, Q&A questions, and proof questions. Among them, multiple choice and judgment questions may possibly be randomly guessed correctly during RL training, easily leading to distorted training feedback. Therefore, we rewrite them into Q&A form to circumvent training deviation caused by random guessing correct answers. The training for proof questions relies on precise verification of each step of deduction, with high complexity in verification logic, which is temporarily filtered and not included in this training stage.

Verifier

Mathematical answers have diverse presentation forms, including scientific notation, LaTeX format, Markdown format, etc., bringing challenges to verification of math accuracy and stability. Some existing research(Yu et al. 2025) simplifies the verification process by converting various answers into formats easy to verify (such as pure numbers or single variables). However, this method substantially lowers the difficulty of the question, resulting in the loss of high-quality effective training data. To preserve the original complexity of the questions and fully leverage high-quality mathematical data, while enhancing the model’s ability to manage diverse mathematical answers in practical settings, we opt against simplifying the various mathematical answers. Instead, we propose a more robust mixed verification approach—a framework that integrates both rule-based and model-based verification. Considering potential inconsistency in model verification, in practice, we employ a hierarchical verification strategy: only samples identified as incorrect by rule-based verification will be further verified by models to improve overall verification accuracy. Experiments on the HardVerify-Math test set(Xu et al. 2025) show that the accuracy of this mixed verification framework reaches 94.4%, notably superior to single verification methods.

Method Rule Verification Model Verification Mixed Verification
Accuracy 76.8 78.4 94.4

Table 1: Accuracy comparison of different mathematical accuracy verification modes on HardVerify-Math test set.

Mixed Tasks

In the mixed-task RL stage, our data include both reasoning tasks (e.g., mathematics, programming, and logical reasoning) and general domain tasks (e.g., STEM, creative writing, and document-based knowledge Q&A).

For reasoning tasks, the final correctness (used as the main reward) are determined based on both rule-based and model-based verification, complemented by format rewards to enhance output quality. For general-domain tasks, answers can take various forms and are often not unique; therefore, we additionally introduce GRM as a validator to provide more accurate feedback signals during the reinforcement learning process. Given the broad coverage of reinforcement learning training data, we face the challenge of achieving strong performance on both reasoning-intensive and general-domain tasks, while avoiding the loss of specific skills and promoting broader generalization. To address this, we regulate the proportion of data for each task to achieve a balance between reasoning and general-domain tasks. This strategy ensures that the model continuously consolidates verifiable skills (such as mathematics, programming, and logical reasoning), while steadily improving its capabilities on general tasks, ranging from complex instruction following to open-ended chain-of-thought reasoning.

STEM Tasks

Taking STEM tasks as an example, to enhance the reasoning ability of high difficulty subjects such as physics and chemistry, we construct a system of knowledge points from textbooks, professional literature, and open-source Q&A data with extensive coverage and noteworthy depth and difficulty for synthetic data. We also employ a voting filtering strategy with multiple strong LLMs to filter out samples with logical discrepancies. Experimental results demonstrate that models trained on a combination of high-difficulty synthetic data and selected natural data significantly outperform those trained on individual data sources, thereby enhancing the model’s capacity for complex reasoning tasks.

To minimize Reward Hacking issues, we mainly use data in Q&A form, and achieve iterative enhancement of both the GenRM and the policy model through defining refined Rubric and expanding Rollout sampling space. This scoring mechanism simultaneously works on both thought processes and final answers significantly enhances the stability and final performance of RL training. In our private high difficulty test set, the trained WeChat-GRM significantly outperforms trivial LLM-as-judge.

Model Accuracy
DeepSeek-V3-0324 57.9
QwQ-32B 58.3
Deepseek-R1-0528 73.6
WeChat-GRM-32B 78.5

Table 2: Performance of different Verifiers on our high difficulty test set.

Stability Training Strategy

Next, we will share our experience in stability training strategies.

Double-precision Expert Router

In RL training of MoE models, routing is responsible for distributing tokens to different experts. With increased number of experts, the routing scores of different experts are often very close. In this situation, the truncation and rounding errors in FP32 precision can lead to incorrect Top-K expert selection, thereby triggering completely different computation paths. To this end, we adopt FP64 high precision routing technology, using double precision float’s 52 decimal precision (compared to FP32’s 23 bits) to ensure accurate distinction between expert scores, avoiding expert misselection caused by precision drift.

Abnormal Sequence Mask

Inconsistency between training and inference is a major challenge in RL training of LLMs, particularly in highly sparse MoE models. As SGLang, vLLM and similar inference frameworks often deeply optimize operator scheduling, precision strategy and other aspects for inference efficiency. Consequently, aligning their implementation details of inference with training is challenging. Such gaps result in the inference results that deviates significantly from current strategy distribution. Therefore, we propose an anomaly sequence detection mechanism incorporating text characteristics and KL (Kullback-Leibler) divergence, conducting adaptive mask filtering on inference results with significant deviation from current strategy(Liu et al. 2025), effectively alleviating the interference of abnormal samples on strategy update, enhancing training process stability and convergence efficiency.

Truncated Importance Sampling

As inference length and training steps increase, probabilistic distribution bias existing between RL training engines and inference engines accumulates continuously, leading to RL training collapse(Yao et al. 2025). To alleviate it, we use IcePop(Zhao et al. 2025) correction technology to rectify distribution bias between training and inference stages through bidirectional truncation and dynamic mask mechanism. The bidirectional truncation mechanism simultaneously truncates tokens with training probability significantly higher or lower than inference probability; dynamic mask mechanism excludes tokens exceeding probability ratio threshold from gradient computation.

Entropy Mechanism

In RL training, LLMs tend to sacrifice uncertainty (entropy) for short-term rewards, leading to sharp decline of strategy entropy at early training stages, trapping models into an “overconfidence” state. We use Clip-Cov(Cui et al. 2025) technique, effectively avoiding rapid collapse of strategy entropy by limiting gradient updates of tokens with high covariance. This mechanism forces the model out of its comfort zone, retaining possible continuous exploration, thus helping strategies break through the entropy bottleneck and significantly enhancing inference performance.

Early Truncation Strategy Based on Repeat Detection

In RL training, we observe that before training collapse occurs, models often enter loops of long repetitive generation during inference, and resulting excessive gradients can disrupt training stability. Hence, we opt for early detection and termination before repetitive segments fully form rather than post-punishment of already generated repetitive texts. Given the simple string match struggles to cover diverse repeat patterns, we employ a heuristic method based on token prediction probability: once models fall into repetitive loops, the prediction probability of the repetitive token will be increased significantly. Following (MiniMax 2025), we set an early truncation rule: stop generation immediately if prediction probability of consecutive N (a pre-defined threshold) tokens exceeds 0.99. In the RL stage for mixed tasks, we further find that the threshold value N across different tasks are significant different, leading to adopting a task-oriented repeat detection early truncation strategy with differentiated thresholds. This method effectively enhanced training stability and improved training efficiency by eliminating long-tailed cases of repetitive generation.

4. Experiments

Benchmarks

We conduct a systematic evaluation of post-training LLMs using a range of multidimensional public benchmarks. These evaluations include the following benchmark tasks:

In addition, we construct some private test sets targeting specific business scenarios and common large model error issues: - Text Rewriting: Text rewriting based on user’s vague instructions. - Time Reasoning: Meeting user’s time needs with scheduling future time points based on given time intervals. - Count: Counting occurrences of a certain character in sentences.

All models are evaluated with the unified implementation details: temperature set to 0.6, top_p set to 0.95, and maximum generation length of 32K. For mathematical, code, scientific question-answering tasks etc., the evaluation is conducted using average of multiple sampling results, where AIME25 sample 64 times, IMO-AnswerBench & HMMT Nov 2025 sample 4 times, LiveCodeBench v6 sample 8 times, GPQA-Diamond sample 8 times.

Experimental Results

Based on the WeLM-V3-258B-A22B base model, we train a version of the Thinking model and a version of the Instruct model using 1024 H20 GPUs. We present the experimental results of these two models as follows:

Thinking Model Comparison

To efficiently verify the method with moderate resources, we limit the maximum inference length to 32K tokens during the RL training stage. Consequently, we also limit the maximum inference length to 32K tokens when testing the WeLM model. The baseline LLMs in RL training stage may use longer inference lengths for better performance, thus we show their results with maximum inference length of both 32K tokens (like DeepSeek-R1-0528-32K and Qwen3-235B-A22B-Thinking-2507-32K) and 128K tokens (like DeepSeek-R1-0528-128K and Qwen3-235B-A22B-Thinking-2507-128K). We use bold text to indicate that the WeLM model’s performance exceeds both DeepSeek-R1-0528-32K and Qwen3-235B-A22B-Thinking-2507-32K simultaneously, and underline text when the WeLM model’s performance only exceeds DeepSeek-R1-0528-32K.

Benchmarks WeLM-258B-A22B-Thinking-32K DeepSeek-R1-0528-32K Qwen3-235B-A22B-Thinking-2507-32K DeepSeek-R1-0528-128K Qwen3-235B-A22B-Thinking-2507-128K
Knowledge MMLU-Pro 84.1 83.8 83.3 84.1 83.8
Chinese-SimpleQA 74.3 69.7 73.5 68.7 79.6
SimpleQA-Verified 34.4 27.4 51.8 29.3 52.1
GPQA-Diamond 79.3 78.9 80.6 79.4 81.1
Alignment LiveBench-20241125 77.0 78.2 81.1 79.0 81.7
IFEval-Pstrict 90.9 80.5 90.3 83.5 90.7
Multi-Challenge 55.9 52.3 45.9 51.9 57.9
COLLIE 93.0 74.7 78.6 77.1 80.6
Writing Bench 79.0 83.8 87.7 83.7 87.7
Inhouse-Text Rewriting 75.0 75.0 81.0 80.0 79.0
Math & Code AIME25 82.2 83.3 77.3 87.8 92.2
HMMT Nov 2025 80.8 66.7 70.8 85.0 83.9
IMO-AnswerBench 59.6 50.4 57.8 69.4 74.5
LiveCodeBench v6-python 67.0 62.0 66.2 75.7 69.3
Reasoning DROP 87.8 87.1 88.6 85.7 88.6
ZebraLogic 96.3 94.4 96.8 97.0 97.8
Inhouse-Time Reasoning 67.0 69.4 68.9 71.4 68.5
Inhouse-Count 73.8 69.9 68.7 73.3 68.3
Role Play SocialBench 82.4 82.4 80.4 82.3 80.6
RoleBench 20.5 13.9 14.2 13.9 14.4
RoleMRC 65.7 68.9 71.6 80.8 85.4
Tool Use AgentIF-CSR 62.7 64.4 63.7 63.0 61.8
BFCL-v4-singleturn-live 84.7 78.4 83.7 77.3 82.5
BFCL-v4-singleturn-nonlive 90.2 85.3 87.4 86.3 88.0
BFCL-v4-multiturn 45.1 34.9 51.5 37.2 52.8
BFCL-v4-memory 31.0 36.1 28.6 32.9 28.4
Tau-2 bench (telecom) 58.8 36.8 46.5 35.1 43.9
Tau-2 bench (airline) 58.0 64.0 56.0 60.0 60.0
Tau-2 bench (retail) 64.0 66.7 69.3 62.3 73.7

Instruct Model Comparison

For Instruct model, we restrict the maximum inference length of both WeLM models and baseline models to 32K tokens during evaluation. Bold text is used to indicate WeLM model’s performance simultaneously exceeds both DeepSeek-V3.2-Instruct and Qwen3-235B-A22B-Instruct-2507, and underline text for cases where WeLM model’s performance only exceeds DeepSeek-V3.2-Instruct.

Benchmarks WeLM-258A22B-Instruct DeepSeek-V3.2-Instruct Qwen3-235A22B-Instruct-2507
Knowledge MMLU-Pro 84.1 84.0 79.0
Chinese-SimpleQA 72.2 70.3 84.8
Chinese-SimpleQA-RAG 97.0 97.3 97.0
SimpleQA-Verified 24.4 26.3 53.8
SimpleQA-Verified-RAG 92.5 94.2 94.0
GPQA-Diamond 77.6 76.5 70.0
Alignment LiveBench-20241125 69.8 73.9 76.4
IFEval-Pstrict 87.8 89.8 89.4
Multi-Challenge 44.8 49.9 51.4
COLLIE 61.4 61.2 57.1
Writing Bench 84.2 81.5 84.6
Inhouse-Text Rewriting 74.0 67.0 70.0
Math & Code AIME25 78.3 56.1 68.6
HMMT Nov 2025 70.0 56.7 65.0
IMO-AnswerBench 60.3 46.9 59.1
LiveCodeBench-v6 53.7 53.2 46.7
Reasoning DROP 86.0 86.0 87.2
ZebraLogic 88.3 84.5 94.1
Inhouse-Time Reasoning 49.4 47.1 49.9
Inhouse-Count 52.1 61.4 63.1
Role Play SocialBench 86.1 84.9 84.0
RoleBench 23.6 21.1 20.8
RoleMRC 80.7 78.7 77.4
Tool Use AgentIF-CSR 56.1 64.5 63.3
BFCL-v4-singleturn-live 81.7 54.1 83.3
BFCL-v4-singleturn-nonlive 89.3 35.0 89.5
BFCL-v4-multiturn 43.8 38.0 41.5
BFCL-v4-memory 23.4 60.2 28.2
Tau-2 bench (telecom) 57.9 72.8 32.5
Tau-2 bench (airline) 60.0 56.0 50.0
Tau-2 bench (retail) 71.9 77.2 74.6

The above experimental results indicate:

  • With a 32K inference length limitation, the WeLM-258B-A22B-Thinking model achieves competitive results in various benchmark datasets, such as mathematics, reasoning, knowledge, alignment tasks. When using longer inference lengths (like 128K), the baseline models can achieve significantly better effects in some difficult tasks (such as IMO-AnswerBench and LiveCodeBench v6). Similar conclusions are also observed in our ongoing long inference RL training experiments.
  • The WeLM-258B-A22B-Instruct model show good competitiveness in several benchmark evaluation sets, including mathematics, reasoning, knowledge, role playing etc.

5. Conclusion and Future Work

This blog introduces the methods and data strategies in our post-training experience of the WeLM-258B-A22B MoE model. We complete the first version of post-training exploration with moderate resources:

  • Our Instruct model achieves relatively competitive results in mathematics, reasoning, knowledge, role playing, and basic tool calling abilities.
  • Our Thinking model limits the maximum inference length to 32K tokens during RL training stage. This constraint enables us to explore successful methodologies in a relatively short time while also challenging the model’s performance in more challenging reasoning tasks.

We are still exploring RL methods to enhance LLMs’ reasoning capability and tool use capabilities. We will continuously explore methods to enhance “intelligence density” within thought chains.

References

AIME. 2025. AIME Problems and Solutions. [Https://artofproblemsolving.com/wiki/index.php/AIME Problems and Solutions.](https://artofproblemsolving.com/wiki/index.php/AIME Problems and Solutions.){.uri}

Balunović, Mislav, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. 2025. MathArena: Evaluating LLMs on Uncontaminated Math Competitions. SRI Lab, ETH Zurich. https://matharena.ai/.

Barres, Victor, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. 2025. arXiv Preprint arXiv:2506.07982.

Broder, Andrei Z. 1997. “On the Resemblance and Containment of Documents.” Compression and Complexity of Sequences 1997. Proceedings, 21–29.

Chen, Hongzhan, Hehong Chen, Ming Yan, et al. 2024. “Socialbench: Sociality Evaluation of Role-Playing Conversational Agents.” Findings of the Association for Computational Linguistics: ACL 2024, 2108–26.

Cui, Ganqu, Yuchen Zhang, Jiacheng Chen, et al. 2025. “The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models.” arXiv Preprint arXiv:2505.22617.

Deshpande, Kaustubh, Ved Sirdeshmukh, Johannes Baptist Mols, et al. 2025. “Multichallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier Llms.” Findings of the Association for Computational Linguistics: ACL 2025, 18632–702.

Dua, Dheeru, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. “DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning over Paragraphs.” arXiv Preprint arXiv:1903.00161.

Haas, Lukas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. 2025. “Simpleqa Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge.” arXiv Preprint arXiv:2509.07968.

He, Yancheng, Shilong Li, Jiaheng Liu, et al. 2025. “Chinese Simpleqa: A Chinese Factuality Evaluation for Large Language Models.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 19182–208.

Huang, Yuzhen, Yuzhuo Bai, Zhihao Zhu, et al. 2023. “C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models.” Advances in Neural Information Processing Systems 36: 62991–3010.

Jain, Naman, King Han, Alex Gu, et al. 2024. “Livecodebench: Holistic and Contamination Free Evaluation of Large Language Models for Code.” arXiv Preprint arXiv:2403.07974.

Kang, Feiyang, Michael Kuchnik, Karthik Padthe, et al. 2025. Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead. https://arxiv.org/abs/2510.01624.

Lin, Bill Yuchen, Ronan Le Bras, Kyle Richardson, et al. 2025. “Zebralogic: On the Scaling Limits of Llms for Logical Reasoning.” arXiv Preprint arXiv:2502.01100.

Liu, Aixin, Aoxue Mei, Bangcai Lin, et al. 2025. “Deepseek-V3. 2: Pushing the Frontier of Open Large Language Models.” arXiv Preprint arXiv:2512.02556.

Lu, Junru, Jiazheng Li, Guodong Shen, et al. 2025. “Rolemrc: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following.” arXiv Preprint arXiv:2502.11387.

Luong, Thang, Dawsen Hwang, Hoang H. Nguyen, et al. 2025. “Towards Robust Mathematical Reasoning.” Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. https://aclanthology.org/2025.emnlp-main.1794/.

MiniMax. 2025. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. https://arxiv.org/abs/2506.13585.

Patil, Shishir G, Huanzhi Mao, Fanjia Yan, et al. n.d. “The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models.” Forty-Second International Conference on Machine Learning.

Qi, Yunjia, Hao Peng, Xiaozhi Wang, et al. 2025. “Agentif: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios.” arXiv Preprint arXiv:2505.16944.

Qiao, Ziqing, Yongheng Deng, Jiali Zeng, et al. 2025. “ConCISE: Confidence-Guided Compression in Step-by-Step Efficient Reasoning.” Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (Suzhou, China), November, 8010–29. https://doi.org/10.18653/v1/2025.emnlp-main.405.

Rein, David, Betty Li Hou, Asa Cooper Stickland, et al. 2024. “Gpqa: A Graduate-Level Google-Proof q&a Benchmark.” First Conference on Language Modeling.

Shao, Zhihong, Peiyi Wang, Qihao Zhu, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. https://arxiv.org/abs/2402.03300.

Wang, Noah, Zy Peng, Haoran Que, et al. 2024. “Rolellm: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models.” Findings of the Association for Computational Linguistics: ACL 2024, 14743–77.

Wang, Yubo, Xueguang Ma, Ge Zhang, et al. 2024. “Mmlu-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.” Advances in Neural Information Processing Systems 37: 95266–90.

White, Colin, Samuel Dooley, Manley Roberts, et al. 2024. “Livebench: A Challenging, Contamination-Free Llm Benchmark.” arXiv Preprint arXiv:2406.19314 4.

Wu, Yuning, Jiahao Mei, Ming Yan, et al. 2025. “Writingbench: A Comprehensive Benchmark for Generative Writing.” arXiv Preprint arXiv:2503.05244.

Xu, Zhangchen, Yuetai Li, Fengqing Jiang, et al. 2025. TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning. https://arxiv.org/abs/2505.14625.

Yao, Feng, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. 2025. “Your Efficient RL Framework Secretly Brings You Off-Policy RL Training.” In Feng Yao’s Notion. https://fengyao.notion.site/off-policy-rl.

Yao, Shunyu, Howard Chen, Austin W Hanjie, Runzhe Yang, and Karthik Narasimhan. 2023. “Collie: Systematic Construction of Constrained Text Generation Tasks.” arXiv Preprint arXiv:2307.08689.

Yu, Qiying, Zheng Zhang, Ruofei Zhu, et al. 2025. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. https://arxiv.org/abs/2503.14476.

Zhao, Xin, Yongkang Liu, Kuan Xu, et al. 2025. Small Leak Can Sink a Great Ship–Boost RL Training on MoE with IcePop! https://ringtech.notion.site/icepop.