On this page

In a previous blog post, we introduced the architecture design and practical experience of pretraining using WeLM-V4-80B-A3B as an example. This post presents a different path to improving model capability — without increasing the parameters of the Transformer backbone, we replicate the Vocab Embedding $n$ times to expand the sequence length by a factor of $n$, granting each token $n$ times the effective computation in a single forward pass. Across experiments at multiple model scales, this method yields consistent loss reductions and benchmark improvements. On larger-scale models, significant gains are observed after only a modest number of continued training steps.

We have open-sourced the Qwen3-8B-based Dense models on Hugging Face. Please refer to the GitHub repository for local deployment with SGLang.

Motivation

In sequence modeling, the representational capacity of a model’s hidden states determines its ultimate performance. Given a fixed model size, enriching the hidden states generally leads to better outcomes — for example, by adopting a narrower-and-deeper architecture (Kim et al. 2023) or by introducing Fine-grained MoE with a larger top-k (Dai et al. 2024).

With the advent of RLVR (DeepSeek-AI 2025) and various Thinking models, extending Test-Time Compute (Snell et al. 2024) through longer Chains of Thought (CoT) (Wei et al. 2022) has been proven to substantially improve the quality of model outputs. However, such explicit reasoning is heavily reliant on the sequential, auto-regressive generation process. Forcing models to “think” in the discrete natural-language space not only introduces significant problem-solving latency, but natural language itself is often an inefficient information carrier for the intricate internal planning of neural networks.

To break free from the constraints of discrete token generation, a line of research (e.g., Pause Tokens (Goyal et al. 2024), Looped Transformers (Giannou et al. 2023), and continuous-thought-chain (Hao et al. 2024) implicit reasoning) has begun exploring how to let models reason within a Continuous Latent Space. Increasing the computational depth and activation volume of hidden states enables models to explore reasoning paths in a more flexible manner (e.g., implicit breadth-first search) without needing to emit actual tokens. However, existing approaches are often limited by sequential computation or incur prohibitive memory overhead.

Against this backdrop, we propose a novel scaling paradigm — Hidden Decoding. Rather than simply deepening or widening the model (which brings heavy activation storage costs), we introduce parallel expansion along the sequence-length dimension, allowing the model to accumulate reasoning depth in the latent space in a parallelized manner. This preserves the loss reduction and accuracy gains afforded by longer contexts while eliminating the latency penalty of traditional CoT serial decoding, achieving a comprehensive improvement in both inference efficiency and model capability through high arithmetic intensity.

Method

Modeling Process

In common thinking models, to pursue higher answer accuracy, the output side typically generates an unbounded number of reasoning tokens (rtokens), and the model relies solely on increasing context length to improve its capability. The core idea of Hidden Decoding is to break the limitation of serial generation: we use the Transformer’s input layer to model reasoning embeddings for reasoning tokens within the continuous latent space. We assume that reasoning representations are uniformly related to the prefix tokens, and thus distribute the reasoning depth uniformly across every token, “folding” the reasoning process and re-arranging it into the main sequence. During inference, thanks to the low memory-access cost of embedding parameters, we scale them at minimal expense.

Figure 1: From CoT to Hidden Decoding

In one sentence: prepare $n$ distinct Embedding matrices to encode the same token sequence $n$ times, interleave the results, and feed them as a sequence of $n\times$ length into the same Transformer. When we simplify the tag inputs during pretraining and compute the loss over all (target) tokens, we arrive at the final form of Hidden Decoding. In this process, the model “thinks $n$ times” about each token in a single forward pass — each pass can attend to the intermediate results of all preceding passes via causal attention, and the final prediction is taken from the last pass ($E_n$), which has the richest context.

Example with n=2

Given an original token sequence $X = (x_1, x_2, \dots, x_L)$, we introduce two independent Embedding matrices $E_1$ and $E_2$. The sequence is interleaved by position to form a new sequence $S$ of length $2L$: $$ S = (E_1(x_1), E_2(x_1), E_1(x_2), E_2(x_2), \dots, E_1(x_L), E_2(x_L)) $$

Taking the sequence [A, B, C] as an example, the input to the Transformer is structured as follows:

Position:       0       1       2       3       4       5
Input seq S:   E₁(A)   E₂(A)   E₁(B)   E₂(B)   E₁(C)   E₂(C)
RoPE pos:       0       1       2       3       4       5
Target:         -      → B      -      → C      -      → D

The entire process uses standard Causal Attention and contiguous Rotary Position Embeddings (RoPE) (Su et al. 2024). The model computes the next-token prediction loss only at the last Embedding of each token (i.e., $E_2$), while $E_1$ participates only in the forward pass.

Under this mechanism, the context available at different positions naturally forms a progressive hierarchy:

Position 0 $E_1(A)$: attends to $\lbrace E_1(A)\rbrace$, performing preliminary “thinking.”
Position 1 $E_2(A)$: attends to $\lbrace E_1(A), E_2(A)\rbrace$, predicting B with a richer context.
Position 2 $E_1(B)$: attends to $\lbrace E_1(A), E_2(A), E_1(B)\rbrace$, continuing to “think.”
Position 3 $E_2(B)$: attends to $\lbrace E_1(A), E_2(A), E_1(B), E_2(B)\rbrace$, predicting C.

The key property is that later Embeddings always have access to richer context. $E_1$ is responsible for preliminary feature extraction and implicit reasoning, while $E_2$ absorbs the state from $E_1$ and is responsible for the final prediction.

Generalization

Extending the above logic to an arbitrary expansion factor $n$:

For an input sequence $X$, prepare $n$ independent Embedding matrices $E_1, \dots, E_n$. In the new sequence $S$, the input at physical position $t = (i-1)n + (k-1)$ is the $k$-th Embedding representation of the $i$-th token: $$ S_t = E_k(x_i) \quad (1 \le k \le n) $$

In a single forward pass with standard full causal attention, the model computes the cross-entropy loss only at the last position $E_n(x_i)$ of each token, predicting $x_{i+1}$ through the shared LM Head. The first $n-1$ Embeddings serve as intermediate steps for implicit reasoning, granting the model additional “thinking” depth before each prediction — realizing a “deliberate before predict” modeling paradigm.

Progressive Expansion

In practice, high-factor models do not need to be trained from scratch. We adopt a Cyclic Replication Initialization strategy that allows the model to progressively expand from a converged lower-factor checkpoint to a higher factor.

Specifically, when a model already has $n$ Embedding matrices $(E_1, E_2, \dots, E_n)$ and we wish to expand it to $2n$, the $n$ newly added Embeddings directly reuse the weights of the existing matrices, i.e., $E_{n+k} \leftarrow E_k \ (1 \le k \le n)$: $$ (E_1, E_2, \dots, E_n) \xrightarrow{\text{expand to } 2n} (E_1, \dots, E_n, E_1, \dots, E_n) $$

Common expansion paths include:

$n=2 \to 4$: weights initialized as $(E_1, E_2, E_1, E_2)$
$n=4 \to 8$: weights initialized as $(E_1, E_2, E_3, E_4, E_1, E_2, E_3, E_4)$

This strategy ensures that the newly added Embeddings behave consistently with the existing ones at the beginning of training, enabling the model to smoothly transition and quickly adapt to the longer sequences introduced by a higher expansion factor while inheriting its existing capabilities. In addition, when expanding the sequence length, it is only necessary to adjust the RoPE base accordingly to accommodate the enlarged positional encoding range.

Experiments

We compare different mixing strategies against the MoE train-from-scratch baseline, and validate the effectiveness of progressive expansion on both Dense and larger-scale MoE models.

Ablation Results

Preserving the generative freedom of reasoning embeddings is critical to the model’s final performance. We compared two variants that both impose additional constraints on the “thinking” process in the latent space, and consequently exhibit negative effects in training loss and downstream benchmarks:

All Token Loss: The cross-entropy loss for predicting the next token is computed at every position $S_t = E_k(x_i) \ (1 \le k \le n)$ in the sequence. This forces all intermediate computation steps (e.g., $E_1, \dots, E_{n-1}$) to prematurely fit the final output, stripping them of their freedom to serve as pure implicit-reasoning states.
Sum: Instead of relying solely on the output of the last pass, the output states at all Embedding positions of the same token are summed before computing the loss. This forcibly couples the outputs across stages, erasing the progressive relationship between different depths of context and weakening the ability of $E_n$ to aggregate the final reasoning result.

Figure 2: Illustration of All Token Loss and Sum variants

28 layers, 6B parameters, 0.6B activated parameters, 209B tokens, AdamW, batch size = 1024, max_lr=8e-4

Method	Loss	MMLU	ARC-C	C-Eval	CMMLU
(1) Baseline	1.908	53.4	68.7	58.4	61.2
(2) All Token Loss, n=2	1.880	55.5	65.8	59.1	63.2
(3) Sum, n=2	1.877	55.6	71.9	61.0	63.3
(4) Hidden Decoding, n=2	1.874	56.5	74.3	61.2	63.7
(4) Hidden Decoding, n=3	1.857	58.5	75.7	62.5	65.9

Table 1: Ablation results on the 6B configuration.

Progressive Expansion on Larger Models

We introduce progressive expansion during the 1.4T high-quality-token annealing phase of WeLM-V4-80B-A3B, with training data and hyperparameters fully aligned. Since no context extension was performed, the context length remains at 4k; during the SFT phase, it is extended to 16k via methods such as YaRN (Peng et al. 2024).

In terms of the training schedule, we perform 2×, 4×, and 8× progressive expansions at 0B, 503B, and 906B tokens, respectively. For the 4× and 8× expansions, the RoPE base is extended to 100k and 500k, respectively. The training loss curve shows rapid convergence, and pretraining benchmarks demonstrate significant improvements.

Figure 3: WeLM-V4-80B-A3B train/loss

Benchmark	# Shots	80B Baseline	80B scale n=2	80B scale n=4	80B scale n=8
Activated params per token	-	3B	3B	3B	3B
Total params	-	80B	80B	80B	80B
Embedding params	-	6.1B	12.1B	24.2B	48.4B
Training tokens	-	1.37T	503B	906B	1.01T
Pile-test (BPB)	-	0.386	0.387	0.382	0.378
BBH (EM)	3-shot	87.5	88.3	90.0	90.6
MMLU (EM)	5-shot	85.1	85.0	86.7	87.5
C-Eval (EM)	5-shot	88.8	88.9	89.9	89.5
SimpleQA (LLM-judge)	5-shot	16.7	15.1	17.2	18.4
Chinese SimpleQA (LLM-judge)	5-shot	60.7	62.2	63.8	64.7
HumanEval+ (Pass@1)	1-shot	61.0	61.0	59.1	62.2
MBPP+ (Pass@1)	1-shot	67.7	64.4	70.2	71.2
MATH (LLM-judge)	4-shot	60.4	58.4	71.4	71.7

Table 2: Progressive expansion results on the 80B configuration.

Similar to depth expansion (Kim et al. 2023) of models, the Hidden Decoding model exhibits output patterns that differ from the base checkpoint. We compared the token-level output entropy of the Baseline and scale n=8 variants across different domains, with results summarized below:

Domain	Model	Avg. Entropy	Entropy Std.
Chinese	80B-A3B	0.8009	0.4369
Chinese	80B-A3B-Scale8	0.5674	0.4139
English	80B-A3B	0.9882	0.4605
English	80B-A3B-Scale8	0.7638	0.5629
Code	80B-A3B	0.6417	0.2426
Code	80B-A3B-Scale8	0.3389	0.3351
Math	80B-A3B	0.2983	0.2117
Math	80B-A3B-Scale8	0.2544	0.1898

Table 6: Comparison of model output entropy in different domains.

Across all domains, the Hidden Decoding model exhibits lower average entropy, indicating reduced output diversity in most domains, while variance shows varying trends. In past practice, the diversity of model outputs typically requires intensive RL to be converted into pass@1 improvements; the progressive expansion of Hidden Decoding offers a more direct path for this conversion. In practice, after simple SFT, the Hidden Decoding model demonstrates remarkable literary creativity, as shown in the examples below:

Output Case 1 Classical Chinese prose

Prompt

以汾河为题赋一篇陆机风格的骈文

Response


# 汾 河 赋
---
惟大晋之御宇，迈三五而齐声。眷中条之奥壤，览汾水之英精。尔其发源管涔，导脉云中；北临燕代之塞，南注孟门之洪。经平陶而迤逦，历晋阳而溶溶。控汾浍之流派，带霍太之崇峰。斯固唐叔之旧壤，有周之所经营者也。

若乃春阳载候，青阳启节。冰泮霞舒，川容镜澈。鸿雁嗷以遵渚，鲔鲂跃而吹沫。原隰骧首以怀新，畎浍含润而流泽。及其朱明戒序，祝融司方；绿杨荫渚，翠荇分塘。渔讴互起于烟浦，菱歌间发于回塘。虽河渭之广博，曾何足以方。

若夫金商应律，玉琯飞灰。霜凄蔓草，风劲高槐。天吴怵以屏迹，阳侯骇而潜隈。于是旅人怨别，思妇含悲；望归云之悠悠，听朔吹之凄其。感三闾之遗叹，愍中林之变衰。彼川流之不辍，岂人事之足追。

尔其控引群渎，朝宗于河。溉汾阴之沃野，浸涑水之嘉禾。盐池映其左纡，首阳峙其右阿。瞻姑射之神岫，想唐尧之逝波。美哉洋洋，实晋国之巨浸也。
---
* 仿陆机《文赋》《豪士赋》骈俪之体，辞尚敷陈，义取比兴。

Hidden Decoding Scale N=4, non-thinking

On traditional Dense models, Hidden Decoding also delivers substantial improvements. In practice, we used Qwen3-8B-Base (Yang et al. 2025) as the baseline and similarly performed progressive expansion. The results are summarized below:

Benchmark	# Shots	8B Baseline	8B scale n=2	8B scale n=4	8B scale n=8
Total params	-	8B	8B	8B	8B
Embedding params	-	1.2B	1.9B	3.1B	5.6B
Training tokens	-	180B	75B	150B	187B
BBH (EM)	3-shot	78.8	81.3	83.0	83.9
MMLU (EM)	5-shot	79.8	80.9	81.9	82.2
MBPP+ (Pass@1)	1-shot	66.7	69.4	68.7	69.4
MATH (LLM-judge)	4-shot	56.0	58.2	60.0	61.1
ARC-C	25-shot	93.9	94.3	94.4	94.7
Hellaswag	10-shot	79.7	83.1	85.0	85.3
GSM8K	4-shot	92.5	93.3	93.9	94.6

Conclusion and Future Work

This post introduces a novel scaling paradigm — Hidden Decoding. By introducing parallel expansion along the sequence-length dimension, the model achieves a multiplicative increase in effective computation per forward pass without adding parameters to the Transformer backbone, all at high arithmetic intensity. Combined with the Cyclic Replication Initialization strategy, we demonstrate that models can smoothly transition from a converged baseline and achieve consistent loss reductions along with significant improvements on core downstream benchmarks across multiple parameter scales (spanning both Dense and MoE architectures).

Looking ahead, we plan to explore Hidden Decoding further along the following directions:

Integration with explicit reasoning (CoT): Implicit latent-space reasoning and explicit natural-language reasoning are highly complementary in terms of model planning capability. We plan to investigate how to effectively combine these two mechanisms to reach higher capability ceilings.
Inference acceleration and engineering optimization: Although Hidden Decoding can stack reasoning depth at extremely low cost during training and controls the increase in inference latency through parallel computation, the multiplicative growth in sequence length inevitably increases inference latency for long-context scenarios. Future work may leverage low-level kernel optimization, dynamic token management, KV cache sharing (Sun et al. 2024; Wu et al. 2024), and compressed attention to further improve online deployment throughput.
Multimodal and broader application scenarios: Further explore the potential and boundaries of leveraging multiple independent Embeddings for cross-modal feature alignment (e.g., implicit alignment of visual and audio features) and in ultra-long-context settings.

References

Dai, Damai, Chengqi Deng, Chenggang Zhao, et al. 2024. “Deepseekmoe: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models.” arXiv Preprint arXiv:2401.06066.

DeepSeek-AI. 2025. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv Preprint arXiv:2501.12948.

Giannou, Angeliki, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. 2023. “Looped Transformers as Programmable Computers.” arXiv Preprint arXiv:2301.13196.

Goyal, Sachin, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Sharan Narang. 2024. “Think Before You Speak: Training Language Models with Pause Tokens.” The Twelfth International Conference on Learning Representations.

Hao, Shibo, Sainbayar Gu, Haotian Ma, et al. 2024. “Training Large Language Models to Reason in a Continuous Latent Space.” arXiv Preprint arXiv:2412.06769.

Kim, Dahyun, Chanjun Park, Sangdoo Kim, Wonsung Lee, Sunghyun Kim, and Yungi Ahn. 2023. “Solar 10.7b: Scaling Large Language Models with Simple Depth up-Scaling.” arXiv Preprint arXiv:2312.15166.

Peng, Bowen, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2024. “YaRN: Efficient Context Window Extension of Large Language Models.” The Twelfth International Conference on Learning Representations.

Snell, Charlie, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. “Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.” arXiv Preprint arXiv:2408.03314.

Su, Jianlin, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. “Roformer: Enhanced Transformer with Rotary Position Embedding.” Neurocomputing 568: 127063.

Sun, Yutao, Li Wang, Yaru Cui, et al. 2024. “You Only Cache Once: Decoder-Decoder Architectures for Language Models.” arXiv Preprint arXiv:2405.05254.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35: 24824–37.

Wu, Yifan, Jinda Li, Hantian Mao, et al. 2024. “KVSharer: Efficient Inference via Layer-Wise Dissimilarity-Based KV Cache Sharing.” arXiv Preprint arXiv:2410.16584.

Yang, An, Anfeng Li, Baosong Yang, et al. 2025. Qwen3 Technical Report. https://arxiv.org/abs/2505.09388.