Huang Ren-Hsuan's Token Economics

K-LinePoet · 2026-03-17T13:31:03+00:00

Economic Observer Reporter Zheng ChenyeNvidia's GTC conference, known as the annual weather vane of the AI industry, was held in San Jose, California, USA from March 16-19 this year.At 11 a.m. local time on March 16 (2 a.m. Beijing time on March 17), Nvidia CEO Jensen Huang delivered a keynote speech lasting over two hours at the SAP Center in San Jose.During his speech, Huang predicted that by 2027, global AI infrastructure-related demand will reach $1 trillion. He also stated that actual demand could be much higher than $1 trillion, and Nvidia's products could even face supply shortages.Following the release of this figure, Nvidia's U.S. stock price jumped more than 4% instantly. However, a few hours later when the A-share market opened, computing power industry chain stocks collectively declined, with Tianfu Telecom (300394) (300394.SZ) closing down more than 1

K-LinePoet

2026-03-17 13:31:03

TechObserver Reporter Zheng Chenye

Known as the annual industry benchmark for AI, NVIDIA’s GTC Conference was held from March 16 to 19 in San Jose, California.

On the morning of March 16, local time—2 a.m. Beijing time—NVIDIA CEO Jensen Huang delivered a keynote speech at the San Jose SAP Center that lasted over two hours.

During his speech, Huang predicted that global demand for AI infrastructure will reach $1 trillion by 2027. He also mentioned that actual demand could be much higher, and NVIDIA’s products might even be in short supply.

After this announcement, NVIDIA’s US stock price surged by over 4% instantly. However, a few hours later, when the A-share market opened, stocks in the computing industry chain collectively declined. Tianfu Communication (300394.SZ) closed down over 10%, and Changguang Huaxin (688048.SH) also fell by 9.72%, erasing nearly five days of gains for most leading stocks.

On one side is the trillion-dollar expectation; on the other, the industry chain stocks plummeted. The disparity stems from different time scales.

Huang was discussing future demand expectations, but his next-generation Feynman chip architecture, scheduled for release in 2028, is still years away. Additionally, Wanjia Securities released a research report on March 16 indicating that the average P/E ratio of the A-share electronics sector was about 82 times as of March 15, suggesting market concerns about “high valuations.”

However, what’s more noteworthy about Huang’s speech isn’t the $1 trillion figure itself but the new business logic he presented over two hours: data centers are shifting from training models to becoming factories that produce tokens.

Tokens are the fundamental units of information processed by large language models, roughly understood as snippets of text generated or processed by AI. One Chinese character roughly corresponds to one or two tokens.

In the past two years, token consumption has experienced several significant jumps.

Huang traced this trend to three key moments: at the end of 2022, ChatGPT launched, enabling AI to generate content and significantly increase token consumption; after the release of ChatGPT’s GPT-3 model, AI learned reasoning and reflection, generating large amounts of tokens internally for self-assessment; and following the release of Claude Code (an AI programming tool developed by Anthropic), AI can read files, write code, and run tests, with each task consuming several orders of magnitude more tokens than simple conversations.

Huang mentioned that all NVIDIA software engineers are using AI to assist with programming.

AI work involves two stages: training, which makes the model smarter and requires a large upfront investment; and inference, which is the model performing tasks daily, with increasing demand. Historically, global GPU (graphics processing units, the core hardware for AI computation) purchases were mainly for training, but now the focus is shifting toward inference.

Huang said that the business scale of inference service providers has grown 100 times in the past year. IDC China analyst Du Yunlong also told TechObserver that domestic inference servers are now growing faster than training servers, with inference accounting for nearly 60% of server shipments by revenue.

While inference demand is exploding, token pricing has yet to be established in the market.

Huang outlined five future pricing tiers: a free tier with high token output but slow response; a mid-tier at about $3 per million tokens; a premium tier at about $6 per million tokens; a high-speed tier at around $45 per million tokens; and a top-tier at approximately $150 per million tokens. Larger models, longer contexts, and faster responses make tokens more expensive.

He used the top-tier as an example: a research team using 50 million tokens daily would spend only about $7,500 at $150 per million tokens, which is not significant for enterprises. After expanding the context window from 32K to 400K tokens, AI can read entire contracts or codebases at once, enabling tasks previously impossible at a cost.

With tiered pricing, the economics of data centers change.

Huang explained that every data center is limited by power. A 1GW (gigawatt) data center will never become 2GW due to power and land constraints. Under fixed power, the data center that consumes the most tokens per watt has the lowest production cost. In other words, the more tokens produced with the same electricity, the more profit.

He presented figures showing that a 1GW data center, allocating compute power across different price tiers, could generate annual revenues of about $30 billion with NVIDIA’s current Blackwell architecture, approximately $150 billion with the new Vera Rubin architecture, and up to $300 billion with Groq’s inference accelerators. The same data center, with different equipment, could see revenue differences of up to tenfold.

NVIDIA’s full-year revenue for fiscal 2026 is $215.9 billion, with data center contributing $193.7 billion.

According to Huang’s logic, existing data centers are underutilized; upgrading to new-generation equipment under the same power conditions could multiply revenue several times. The trillion-dollar expectation isn’t due to chip price increases but because the same electricity can produce more and higher-value tokens.

Huang said that in the future, every CEO will focus on the efficiency of their token factories, as that directly correlates with revenue.

He also described a trend emerging in Silicon Valley: more engineers are using AI daily for coding, research, and document processing—all of which consume tokens. Companies will need to budget for these AI-related expenses.

Huang predicted that this expense will become so significant that it will require dedicated budgets, similar to how companies allocate funds for computers and software for employees.

He further stated that each engineer will receive an annual token budget upon hiring, roughly equivalent to half their base salary.

Two Types of Chips

The hardware corresponding to this token economy is the Vera Rubin platform, officially announced at GTC.

Huang said that in the past, when discussing the Hopper architecture, he would hold up a chip. But Vera Rubin isn’t just a chip; it’s an entire system. This system achieves 100% liquid cooling, reducing installation time from two days to just two hours.

Vera Rubin consists of seven chips. The core rack, NVL72, integrates 72 Rubin GPUs and 36 Vera CPUs, connected via NVLink 6 (NVIDIA’s high-speed interconnect technology). Compared to the previous Blackwell generation, it offers up to 10 times higher inference throughput per watt, and the cost per token has been reduced to one-tenth.

NVIDIA also announced a new 88-core Vera CPU, optimized for AI agent scenarios involving tool invocation and data processing.

Huang said that Microsoft CEO Satya Nadella has confirmed that the first Vera Rubin racks are already running on Azure.

However, Vera Rubin has a limitation: when each user needs to generate more than 400 tokens per second, the NVL72’s bandwidth becomes insufficient. To address this, NVIDIA acquired Groq, a US AI acceleration chip company founded in 2016, and licensed its technology and core team.

Groq’s LPU (Language Processing Unit) and GPU are entirely different chips. GPUs have large memory and high computing power—each Rubin GPU has 288GB of memory, suitable for complex calculations. LPUs have small but extremely fast memory—only 500MB—unable to hold full model parameters but capable of generating tokens at much higher speed and lower latency than GPUs.

NVIDIA uses a software called Dynamo to split inference into two steps: context understanding, which requires substantial compute and memory and is handled by Vera Rubin; and token generation, which is latency-sensitive and handled by Groq’s LPU. These components are connected via high-speed Ethernet, reducing latency by about half.

Huang calls this approach “decoupled inference,” acknowledging that high throughput and low latency are inherently conflicting. It’s better to let each chip do what it’s best at.

He said this combination delivers 35 times the performance of the previous generation at price points of $45 and $150.

Over a longer timeframe, a 1GW data center could increase token generation from 22 million per second to 700 million in two years.

Huang advised clients that if their workload mainly involves high-throughput batch inference, they should fully adopt Vera Rubin; for tasks requiring extensive programming or real-time interaction, they can allocate about 25% of their data center capacity to Groq LPU.

He stated that Groq’s 3 LPU chips are manufactured by Samsung and are expected to ship in the third quarter of this year.

On the software side, NVIDIA launched the enterprise AI platform NemoClaw, supporting the popular open-source project OpenClaw. OpenClaw became the fastest-growing open-source project on GitHub within weeks, and Huang compared its importance to Linux, calling it the operating system for intelligent agent computing.

However, deploying open-source OpenClaw directly in enterprise environments poses security risks, as intelligent agents can access sensitive data, execute code, and communicate externally. NemoClaw adds an enterprise security layer to OpenClaw. Companies like Adobe, Salesforce, and SAP have announced adoption of NVIDIA’s Agent Toolkit for developing intelligent agents.

Regarding future plans, NVIDIA previewed the next-generation Feynman architecture, scheduled for release in 2028, which will support both copper cabling and CPO (chip-integrated optical communication) interconnects.

This year also marks the 20th anniversary of CUDA, NVIDIA’s GPU computing platform, which is considered the foundation of NVIDIA’s software ecosystem. Huang mentioned that currently, 60% of NVIDIA’s business comes from the top five global cloud providers, with the remaining 40% spread across sovereign AI, enterprise, industrial, and robotics sectors.

At this GTC, NVIDIA also announced collaborations with Uber, BYD (002594), Geely, Hyundai, Nissan, and Isuzu in autonomous driving. Driven by this news, Hong Kong’s auto stocks rallied on the 17th, with Geely Auto (00175.HK) surging over 5% intraday and closing up 4.55%.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.