Inception Labs Launches Mercury 2, Diffusion-Based Reasoning Model Achieving Over 1,000 Tokens Per Second

2026-02-26 09:42:03

In Brief

Inception Labs has launched Mercury 2, a diffusion-based reasoning model capable of generating over 1,000 tokens per second, three times faster than comparable models.

Inception Labs, an AI startup, has launched Mercury 2, a diffusion-based Large Language Model (LLM) designed to significantly accelerate reasoning tasks in production AI applications

Unlike traditional autoregressive models that generate text sequentially, Mercury 2 uses a parallel refinement process, producing multiple tokens simultaneously and converging over a small number of steps, enabling speeds of over 1,000 tokens per second on NVIDIA Blackwell GPUs—approximately three times faster than competing models in the same price range.

The model is optimized for real-time responsiveness in complex AI workflows, where latency compounds across multiple inference calls, retrieval pipelines, and agentic loops. Mercury 2 maintains high reasoning quality while reducing latency, allowing developers, voice AI systems, search engines, and other interactive applications to operate at reasoning-grade performance without the delays associated with sequential generation. It supports features such as tunable reasoning, 128K token context windows, schema-aligned JSON output, and native tool integration, providing flexibility for a range of production deployments.

Mercury 2 Enables Low-Latency AI Across Coding, Voice, And Search Workflows

The report highlights several use cases where low-latency reasoning is critical. In coding and editing workflows, Mercury 2 delivers rapid autocomplete and next-edit suggestions that integrate seamlessly with developers’ thought processes. In agentic workflows, the model allows for more inference steps without exceeding latency budgets, improving the quality and depth of automated decision-making. Voice-based AI and interactive applications benefit from its ability to generate reasoning-quality responses within natural speech cadences, enhancing user experiences in real-time conversation scenarios. Additionally, Mercury 2 supports multi-hop search and retrieval pipelines, enabling rapid summarization, reranking, and reasoning without compromising response times.

Early adopters have noted significant improvements in throughput and user experience. Mercury 2 has been described as at least twice as fast as GPT-5.2 while maintaining competitive quality, with applications spanning real-time transcript cleanup, interactive human-computer interfaces, autonomous advertising optimization, and voice-enabled AI avatars.

The model is compatible with the OpenAI API, allowing integration into existing stacks without extensive modification, and Inception Labs offers support for enterprise evaluations, performance validation, and workload-specific deployment guidance. Mercury 2 represents a step forward in diffusion-based LLMs, redefining the balance between reasoning quality and latency in production AI environments.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.