Nvidia unveils Rubin CPX GPU for massive-context inference for 2026

Become a member of GB MAX to gain exclusive access to the industry and to the most influential global B2B leadership community in the business of gaming, entertainment, and tech. Join now and also get a VIP ticket to GamesBeat Next (Nov 2-3, SF).

Nvidia announced the Nvidia Rubin CPX, a new class of graphics processing unit (GPU) purpose-built for massive-context processing.

This enables AI systems to handle million-token software coding and generative video
with groundbreaking speed and efficiency. The new chip is coming at the end of 2026.

Rubin CPX works hand in hand with Nvidia Vera CPUs and Rubin GPUs inside the new Nvidia Vera Rubin NVL144 CPX platform. This integrated Nvidia MGX system packs 8 exaflops of AI compute to provide 7.5x more AI performance than Nvidia GB300 NVL72 systems, as well as 100TB of fast memory and 1.7 petabytes per second of memory bandwidth in a single rack. Rubin CPX is offered in other flexible configurations for customers looking to reuse existing infrastructure.

“The Vera Rubin platform will mark another leap in the frontier of AI computing —
introducing both the next-generation Rubin GPU and a new category of processors
called CPX,” said Jensen Huang, founder and CEO of Nvidia, in a statement. “Just as RTX revolutionized graphics and physical AI, Rubin CPX is the first CUDA GPU purpose-
built for massive-context AI, where models reason across millions of tokens of knowledge at once.”

Nvidia Rubin CPX enables the highest performance and token revenue for long-context processing — far beyond what today’s systems were designed to handle. This transforms AI coding assistants from simple code-generation tools into sophisticated systems that can comprehend and optimize large-scale software projects.

“To help support this accelerated roadmap, we need to collaborate closely with our data center partners so the ecosystem can release new products at a similarly rapid pace,” said Narasimha. “We’re expanding beyond our rack scale and superpod reference design. We’re going to announce that Nvidia will release an AI factory data scale reference design, which is purpose-built to design, simulate, optimize and operate biggest scale AI factories and smaller designs.”

To process video, AI models can take up to a million tokens for an hour of content, pushing the limits of traditional GPU compute. Rubin CPX integrates video decoder and encoders, as well as long-context inference processing, in a single chip for unprecedented capabilities in long-format applications such as video search and high-quality generative video.

Built on the Nvidia Rubin architecture, the Rubin CPX GPU uses a cost‐efficient, monolithic die design packed with powerful NVFP4 computing resources and is optimized to deliver extremely high performance and energy efficiency for AI inference tasks.

Nvidia said back at the Computex event in Taiwan that it plans to launch new GPUs every year, faster than it has in past years. And so the Vera Rubin architecture, announced at the GTC keynote in March, will serve as the flagship for the vast majority of AI use cases in the future. It will be deployed in a standard liquid-cooled rack design.

“It enables AI service providers to dramatically increase their possibilities. It delivers $5 billion dollars of revenue for every $100 million invested in infrastructure, and 50 times revenue returns,” Narasimha said.

Advancements offered by Rubin CPX

Rubin CPX delivers up to 30 petaflops of compute with NVFP4 precision for the highest performance and accuracy. It features 128GB of cost-efficient GDDR7 memory to accelerate the most demanding context-based workloads. In addition, it delivers three times faster attention capabilities compared with Nvidia GB300 NVL72 systems — boosting an AI model’s ability to process longer context sequences without a drop in speed.

Rubin CPX is offered in multiple configurations, including the Vera Rubin NVL144 CPX, that can be combined with the Nvidia Quantum‐X800 InfiniBand scale-out compute fabric or the Nvidia Spectrum-XTM Ethernet networking platform with Nvidia Spectrum-XGS Ethernet technology and Nvidia ConnectX-9 SuperNICs.

Vera Rubin NVL144 CPX enables companies to monetize at an unprecedented scale, with $5 billion in token revenue for every $100 million invested. It’s like Huang’s old saying, “The more you buy, the more you save.”

Industry leaders look to Rubin CPX

AI innovators are exploring how Rubin CPX can accelerate their applications, ranging from large-scale software development to the analysis of dynamic visual content to better understand moving images.

Cursor, an AI-powered software company that offers an advanced code editor, sees the benefits of Rubin CPX to boost developer productivity with intelligent code generation and collaborative tools directly in the coding environment.

“With Nvidia Rubin CPX, Cursor will be able to deliver lightning-fast code generation and developer insights, transforming software creation,” said Michael Truell, CEO of Cursor, in a statement. “This will unlock new levels of productivity and empower users to ship ideas
once out of reach.”

Runway, an American generative AI company, will use Nvidia technologies to enable
creators to produce cinematic content and sophisticated visual effects with unmatched scale and efficiency.

“Video generation is rapidly advancing toward longer context and more flexible, agent-driven creative workflows,” said Cristóbal Valenzuela, CEO of Runway, in a statement. “We see Rubin CPX as a major leap in performance, supporting these demanding workloads to build more general, intelligent creative tools. This means creators — from independent artists to major studios — can gain unprecedented speed, realism and control in their work.”

Magic is an AI research and product company developing foundation models to power AI agents that can automate software engineering.

“With a 100-million-token context window, our models can see a codebase, years of
interaction history, documentation and libraries in context without fine-tuning,” said
Eric Steinberger, CEO of Magic, in a statement. “This enables users to coach the agent at test time through conversation and access to their environments, bringing us closer to
autonomous agentic experiences. Using a GPU like Nvidia Rubin CPX greatly accelerates our compute workloads.”

Software support

Nvidia Rubin CPX will be supported by the complete Nvidia AI stack — from accelerated infrastructure to enterprise‐ready software. The Nvidia Dynamo platform efficiently scales AI inference, dramatically boosting throughput while cutting response times and model serving costs.

The processors will be able to run the latest in the Nvidia Nemotron family of multimodal models that provide state-of-the-art reasoning for enterprise-ready AI agents. For production-grade AI, Nemotron models can be delivered with Nvidia AI Enterprise, a software platform that includes Nvidia NIM microservices as well as AI frameworks, libraries and tools that enterprises can deploy on Nvidia-accelerated clouds, data centers and workstations.

Built on decades of innovation, the Rubin platform extends Nvidia’s developer ecosystem — with Nvidia CUDA‐X libraries, a community of over six million developers and nearly 6,000 CUDA applications.

Availability

Nvidia Rubin CPX is expected to be available at the end of 2026. Learn more by watching Nvidia exec Ian Buck’s keynote at AI Infra Summit on Sept. 9 at 10 a.m. Pacific time.

Shar Narasimha, director of product for data center at Nvidia, said in a press briefing that use cases are emerging that require massive context length of over a million tokens.

The vast majority of use cases have small to medium context lengths, generally under 256,000 tokens for use cases such as enterprise chatbot Q&A and paragraph summarization.

But there’s other use cases with a context window of one million tokens.

“These can translate to over 100,000 lines of code, enabling AI agents to deliver and move beyond simple bug fixes in code and support advanced software applications and systems development,” Narasimha said. “100,000 tokens also represent over an hour of HD video. This enables contextually aware, temporally stable video generation, high-value AI use cases that require large context.”

To meet these demanding use cases of one million token context length and higher, a new type of context GPU is required, Narasimha said.

With traditional inference, a single GPU handles the entire AI query from a user. But with bigger workloads, the inference task takes two different phases. One is context, or prefill, and it refers to understanding the query from a user and generating the first token. This task is compute intensive.

The other is generation, which is about generation, or decode, of all the subsequent tokens. This is memory-bandwidth intensive. Both tasks are best handled by separate GPUs.

“When we use a single GPU for this type of workload, you actually have to split the average of the configuration of the GPU so it can service both context and generation at the same time. As a result, we’re not taking the most advantage of a capable group,” Narasimha said.

“Now in disaggregated serving, we actually have two different GPUs used to serve a single prompt to improve AI factory performance per watt and ROI — a context GPU that is heavily compute optimized for the context phase and similarly, a generation GPU generation phase, with much more memory bandwidth and much more NVLink bandwidth. As a result, we’ve actually doubled the number of GPUs, but we’ve increased throughput by six times as a result.”

To do this, Nvidia figured it needed a new type of context-specific GPU. And at the AI Infra Summit, Nvidia is announcing its latest GPU as Nvidia Rubin CPX, a chip coming next year. It’s a GPU for massive context length use cases.

“It’s purpose-built for long context performance with exceptional throughput and efficiency. It unlocks a new tier of premium use cases like intelligent coding systems and video generation,” Narasimha said. “Ruben CPX is designed for massive context length use cases. It will dramatically increase the productivity and performance of AI factories.”