Lin Junyang's departure from Alibaba and his first post: The era of intelligent agents is arriving

金色财经_

2026-03-27 10:16:08

Author: Lin Junyang, former head of Qwen at Tongyi, the youngest P10 at Alibaba. Left Alibaba in March 2026.

Original Title: “From ‘Reasoning’ Thinking to ‘Agentic’ Thinking”

The past two years have reshaped the way we evaluate models and our expectations of them. OpenAI’s o1 has demonstrated that “thinking” can become a first-class capability, a skill that you can specifically train for and make available to users. DeepSeek-R1 has proven that this style of reasoning-based fine-tuning can be replicated and scaled well beyond the top labs initially. OpenAI describes o1 as a model trained using reinforcement learning that “thinks before answering”; while DeepSeek positions R1 as an open-source reasoning model capable of competing directly with o1.

That phase was significant. However, the focus in the first half of 2025 was primarily on “reasoning thinking”: how to enable models to invest more reasoning computational power, how to train them with stronger reward signals, and how to present or control this additional reasoning investment. The question now is, what comes next? I believe the answer is “agentic thinking”: thinking to act, continuously updating plans based on feedback from the real world during interactions with the environment.

What the Rise of o1 and R1 Truly Taught Us

The first wave of reasoning models taught us that if we want to expand reinforcement learning (RL) within language models, we need deterministic, stable, and scalable feedback signals. Fields like mathematics, code, and logic have become core because, in these scenarios, the reward signals are much stronger than conventional preference supervision. They allow reinforcement learning to optimize for “correctness” rather than “plausibility.” Infrastructure then became a top priority.

Once models are trained to reason through longer trajectories, reinforcement learning is no longer just a lightweight add-on to supervised fine-tuning (SFT). It turns into a complex system problem. You need large-scale rollout strategies, high-throughput validation mechanisms, stable policy updates, and efficient sampling capabilities. The emergence of reasoning models is not only a breakthrough in modeling capabilities but also a victory for infrastructure engineering. OpenAI describes o1 as a reasoning product line trained with RL, while DeepSeek R1 later confirmed this direction, demonstrating how vast the dedicated algorithms and foundational work needed for reasoning-based RL is. This marks the industry’s first significant shift: from scaling pre-training to scaling post-training for reinforced reasoning capabilities.

The Real Challenge Has Never Been Just “Merging Thinking and Instructions”

At the beginning of 2025, many in our Qwen team had a grand vision: the ideal system should unify the two modes of “thinking” and “instructions.” It should support adjustable reasoning intensity, mentally similar to “low/medium/high” settings for reasoning. Better yet, it should be able to automatically infer the required amount of reasoning based on prompts and context, thereby deciding when to respond immediately, when to think a little longer, and when to invest massive computational resources into truly challenging problems.

Conceptually, this is the right direction. Qwen3 is one of the clearest public attempts. It introduced a “hybrid thinking mode” that balances thinking and non-thinking behaviors within the same model series, emphasizes a controllable thinking budget, and describes a four-stage post-training pipeline—explicitly including “thinking mode integration” after long-chain reasoning (long-CoT) cold starts and reasoning RL.

However, merging these two modes is easier said than done. The difficulty lies in the data. When people talk about merging thinking and instructions, the first consideration often revolves around model compatibility: can a checkpoint support both modes? Can a chat template seamlessly switch between the two? Can the service stack provide the necessary control switches? But the deeper contradiction lies in the inherent differences between the data distributions and behavioral goals of these two modes.

In trying to balance “model merging” with “improving post-training data quality and diversity,” we have stumbled into some pitfalls. During our review process, we closely observed how users actually utilized the thinking and instruction modes in real scenarios. A strong instruction model typically receives rewards from direct, concise, format-following outputs, and maintaining extremely low latency on repetitive, large-volume enterprise tasks (such as rewriting, labeling, templating support, structured extraction, and operational Q&A). In contrast, a strong thinking model is rewarded for consuming more tokens on difficult problems, maintaining internal logical coherence, exploring alternative paths, and retaining sufficient internal computation to substantively enhance final accuracy.

These two behavioral modes often constrain each other. If the merged data isn’t carefully crafted, the result can be unappealing: “thinking” behavior becomes noisy, bloated, or indecisive; while “instruction” behavior loses its decisiveness, reliability declines, and the cost of use far exceeds commercial users’ actual expectations.

Thus, in practice, separating the two remains attractive. Later in 2025, following Qwen3’s initial hybrid architecture, the 2507 product line released distinctly separate Instruct (instruction) and Thinking (thinking) updates, including independent 30B and 235B variants. In commercial deployments, many clients still aspire to achieve high throughput, low cost, and highly controllable instruction behavior for batch operations. In these scenarios, merging does not yield significant benefits. Separating the two product lines instead allows the team to address the specific data and training challenges unique to each mode more purely.

Other labs have chosen the opposite route. Anthropic publicly advocates for an integrated model concept: Claude 3.7 Sonnet is positioned as a hybrid reasoning model, allowing users to choose between conventional responses or extended thinking, and API users can set thinking budgets. Anthropic has explicitly stated that they believe reasoning should be an embedded integrated capability rather than a detached independent model. GLM-4.5 also markets itself as a hybrid reasoning model combining both modes, attempting to integrate reasoning, coding, and agency capabilities; DeepSeek has since launched the V3.1 “thinking and non-thinking” hybrid reasoning mechanism.

The core issue here is whether this integration is natural and organic. If thinking and instructions are merely forcefully crammed into the same model weights, appearing as two awkwardly stitched independent personas, the product experience will still feel very disjointed. Truly successful integration requires a smooth range of reasoning investment. The model should be able to express different levels of investment intensity, and ideally, make adaptive choices. GPT-style intensity control precisely points this out: it is a strategy for distributing computational resources, not just a simple binary switch.

Why Anthropic’s Direction is a Beneficial Correction

Anthropic’s public communications during the release of Claude 3.7 and Claude 4 have been quite restrained. They focused on integrated reasoning, user-controllable thinking budgets, real-world tasks, coding quality, and subsequently launched capabilities for tool invocation during extended thinking periods. Claude 3.7 is showcased as a budget-controllable hybrid reasoning model; Claude 4 goes further, allowing the reasoning process to intertwine with tool calls. Meanwhile, Anthropic repeatedly emphasizes that coding, long-duration tasks, and agent workflows are their core objectives.

Simply generating longer reasoning trajectories does not automatically make the model smarter. In many cases, excessive exposure to the reasoning process instead reveals inefficiencies in computational resource allocation. If a model tries to reason through everything in the same lengthy manner, it indicates a failure to prioritize, to distill information, or an inability to take real action. Anthropic’s development trajectory conveys a more disciplined perspective: thinking should be shaped by the target workload. If the goal is coding, then the value of thinking should be reflected in codebase navigation, planning, task decomposition, error recovery, and tool orchestration. If the goal is agent workflows, then thinking should focus on improving execution quality over long periods, rather than composing a flowery interim discourse.

This emphasis on “goal utility” points to a broader trend: we are transitioning from an era of training models to an era of training agents. We also made this clear in our Qwen3 blog—“We are transitioning from an era focused on training models to an era centered on training agents,” linking future RL breakthroughs to the environmental feedback required for long-term reasoning. An “agent” is defined as a system capable of formulating plans, deciding when to act, invoking tools, perceiving environmental feedback, adjusting strategies, and operating continuously over extended periods. Its essential definition lies in the closed-loop interaction with the real world.

What “Agentic Thinking” Truly Means

Agentic thinking is a distinctly different optimization goal. The criterion for assessing “reasoning thinking” typically concerns the quality of internal deliberation before arriving at the final answer: can the model solve theorems, write proofs, generate bug-free code, or pass benchmark tests? In contrast, the standard for evaluating “agentic thinking” is whether the model can continuously achieve substantive progress while interacting with the environment.

The core question shifts from “Does the model think long enough?” to “Is the way the model thinks sufficient to support effective action?” Agentic thinking must address several challenges that purely reasoning models can largely avoid:

a. Deciding when to stop thinking and take action
b. Choosing which tool to invoke and in what order
c. Integrating noisy or incomplete observations from the environment
d. Adjusting plans after encountering failures
e. Maintaining logical coherence in multi-turn dialogues and multiple tool invocations

In short, models with agentic thinking must reason through action.

Why the Infrastructure Difficulty of Agentic Reinforcement Learning is Greater

Once the goal shifts from “solving benchmark test questions” to “completing interactive tasks,” the RL tech stack undergoes a dramatic change. The infrastructure traditionally used for reasoning RL is no longer sufficient. In reasoning RL, you can typically treat rollouts as relatively independent trajectories, equipped with clear evaluators. However, in agentic RL, strategies are deeply embedded within a vast supporting framework: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static referee; it becomes an inseparable part of the entire training system.

This gives rise to a completely new system-level demand: training and reasoning must be decoupled more thoroughly. Without this decoupling, the throughput of rollouts will directly collapse. Imagine a coding agent that must run its generated code within a real-time testing framework: the reasoning side would be forced to pause while waiting for execution feedback, while the training side would starve due to the lack of complete trajectory data, leading to GPU utilization across the entire pipeline being far below traditional reasoning RL levels. If you add in tool delays, local observability, and stateful environments, these inefficiencies would be further magnified. The result is that long before you reach the expected capability metrics, the entire experiment’s progress becomes extremely slow and painful.

The environment itself therefore rises to become a core research product. In the SFT (supervised fine-tuning) era, we have eagerly pursued data diversity. In the agent era, we should focus intensely on the quality of the environment: stability, realism, scenario coverage, difficulty gradients, state diversity, feedback richness, anti-cheating capabilities, and the scalability of rollout generation. Building virtual environments has become a truly hardcore entrepreneurial track rather than a side project. If agents are destined to be trained under conditions similar to production environments, then the environment itself is a part of the core capability technology stack.

The Next Frontier: More Practical Thinking Power

My personal expectation is that agentic thinking will become the dominant form of thinking in the future. I believe it will ultimately eliminate most outdated “static monologue” reasoning thinking—that overly lengthy, isolated, closed-off approach that attempts to mask a lack of interactive capability through an abundance of text. Even in the face of extremely difficult mathematical or coding tasks, a truly advanced system should have the rights to search, simulate, execute, check, verify, and modify. Our ultimate goal is to robustly and efficiently solve real-world problems.

The biggest pain point in training such systems is “reward hacking.” Once a model gains substantial tool access, reward hacking can become extremely destructive. A model with search functionality might directly learn to go online for answers during RL training. A coding agent might exploit undisclosed future information in the codebase, misuse logs, or find some shortcut that directly invalidates the task. An environment with hidden flaws may make the model’s strategies appear exceptional, but in reality, it has merely trained a cheat expert. Compared to the reasoning era, the agent era’s landscape is far more delicate and dangerous. More powerful tools make models more useful, but they also exponentially amplify the attack surface for false optimization. We can fully anticipate that the next severe academic bottleneck will arise in environment design, the robustness of evaluators, anti-cheating protocols, and establishing more normative interface standards between strategies and the physical world. Despite numerous challenges, the overall direction is unwavering: tool-empowered thinking is inherently more valuable than introspective thinking and is more likely to result in genuine leaps in productivity.

Agentic thinking also signifies the rise of “harness engineering.” Future core intelligence will increasingly rely on the collaborative organization of multiple agents: a central orchestrator responsible for planning and scheduling tasks, specialized agents serving as domain experts, and sub-agents responsible for executing vertically segmented tasks (they not only perform tasks but can also help control context, avoid memory contamination, and maintain physical separation between different levels of thinking). The future of the industry is shifting from training models to training agents, and ultimately moving towards training vast systems.

Conclusion

The first phase of the reasoning wave established an ironclad rule: as long as the feedback signals are sufficiently reliable and the infrastructure can support it, layering reinforcement learning on top of language models can lead to transformative cognitive abilities.

A more profound industry shift is underway, transitioning from “reasoning thinking” to “agentic thinking”: from simply thinking a little longer to thinking in order to take action. The core training targets have shifted. It is no longer just the model itself but a symbiotic system of “model + environment,” more specifically, the agent and its surrounding supporting framework. This fundamentally overturns our understanding of what constitutes a “core research product”: while model architecture and training data are indeed important, environment design, the infrastructure for rollout strategies, the resilience of evaluators against interference, and the underlying interfaces for multi-agent collaboration will be elevated to equal or even higher status. It also redefines what constitutes “good thinking”: true “good” refers to thought trajectories that most effectively support action under the various constraints of the real world, rather than merely competing over who generates the longest text or whose reasoning process is the most ostentatious.

This also alters the logic of competitive advantages in future business. In the reasoning era, whoever has better RL algorithms, purer feedback signals, and more scalable training pipelines will win. In the agent era, the trump card will become who has the most realistic environments, the smoothest “training-reasoning integration” architecture, superior harness engineering capabilities, and who can most perfectly close the critical feedback loop between “model decisions” and “the real consequences of those decisions.”

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.