Inference performance is critical, as it directly influences the economics of an AI factory, said Dave Salvatore, director of accelerated computing products at Nvidia, in a press briefing.
The higher the throughput of AI factory infrastructure, the more tokens it can produce at a high speed — increasing revenue, driving down total cost of ownership (TCO) and enhancing the system’s overall productivity. Salvatore said for every $100 million invested in an AI factory, AI chips like Rubin CPX can generate $5 billion in revenue.
As large language models (LLMs) grow larger, they get smarter, with open models from leading developers now featuring hundreds of billions of parameters. At the same time, today’s leading models are also capable of reasoning, which means that they generate many intermediate reasoning tokens before delivering a final response to the user. The combination of these two trends—larger models that think using more tokens—drives the need for significantly higher compute performance.
Salvatore said there are challenges around deploying inference at scale. He said increased performance per watt per dollar ultimately translates into higher profit for organizations. Performance drives revenue and that in turn drives profits. And both full-stack architecture and software drive performance, he said.
“Inference performance is going to drive what we think of as AI factory economics, delivering these services, delivering these applications in a cost-effective way,” Salvatore said.
Less than half a year since its debut at Nvidia GTC, the Nvidia GB300 NVL72 rack-scale
system — powered by the Nvidia Blackwell Ultra architecture — is setting records on the new reasoning inference benchmark in MLPerf Inference v5.1, delivering up to 1.4 times more DeepSeek-R1 inference throughput compared with Nvidia Blackwell-based GB200 NVL72 systems.
Blackwell Ultra builds on the success of the Blackwell architecture, with the Blackwell Ultra
architecture featuring 1.5 times more NVFP4 AI compute and two times more attention-layer acceleration than Blackwell, as well as up to 288GB of HBM3e memory per GPU.
The Nvidia platform also set performance records on all new data center benchmarks added to the MLPerf Inference v5.1 suite — including DeepSeek-R1, Llama 3.1 405B Interactive, Llama 3.1 8B and Whisper — while continuing to hold per-GPU records on every MLPerf data center benchmark.
“When we think about performance historically, we generally think faster is better, and a lot of that is still true. But as you think about deploying inference at scale, you have to solve for multiple aspects,” Salvatore said. “Throughput, what we think of as traditional speed, absolutely is a factor. But so is responsiveness, that ability to start the answer flowing quite quickly, and of course, the quality of the answers matters tremendously. Also, avoidance of inappropriate answers, and whose initiative, whose nations and things like that. And then additional pieces that are very related are both energy efficiency and then somewhat related to that is cost for organizations looking to deploy AI inference.”
The result is that it takes sophisticated answers to figure out how to deploy AI inference cost effectively.
“We’re thinking about all these different vectors as a way to really deliver both the best user experience for the most users, while allowing organizations to do that in a cost-effective way.”
Stacking it all up
Full-stack co-design plays an important role in delivering these latest benchmark results.
Blackwell and Blackwell Ultra incorporate hardware acceleration for the NVFP4 data format — an Nvidia-designed 4-bit floating point format that provides better accuracy compared with other FP4 formats, as well as comparable accuracy to higher-precision formats.
Nvidia TensorRT Model Optimizer software quantized DeepSeek-R1, Llama 3.1 405B, Llama 2 70B and Llama 3.1 8B to NVFP4. In concert with the open-source Nvidia TensorRT-LLM library, this optimization enabled Blackwell and Blackwell Ultra to deliver higher performance while meeting strict accuracy requirements in submissions.
We’re delivering leadership performance on every one of these tests, which speaks not only to our performance but our versatility,” Salvatore said.
Disaggregated serving
Another important factor is disaggregated serving. Within inference, Salvatore said there are two key operations. One is called context, and the other is generation. Context, referred to as prefill, is parsing an incoming query from the end user and then understanding what the query is and then generating the first output token.
Generation is focused on generating all the subsequent output tokens, Salvatore said.
“And so what we did is a disaggregated serving where we basically set up dedicated GPUs — quite a few to handle context — and then others to handle generation,” he said.
There were about 56 GPUs with bandwidth of 130 terabytes per second. By having dedicated GPUs for each of the operations, there were good performance gains compared to the previous Hopper generation that tried to do context and generation on the same GPU. The result is five times greater performance with the new generation of Blackwell Ultra. The additional performance from disaggregated serving translates into real ROI for organizations.
Large language model inference consists of two workloads with distinct execution characteristics: 1) context for processing user input to produce the first output token and 2) generation to produce all subsequent output tokens.
Disaggregated serving splits context and generation tasks so each part can be optimized independently for best overall throughput. This technique was key to record-setting performance on the Llama 3.1 405B Interactive benchmark, helping to deliver a nearly 50% increase in performance per GPU with GB200 NVL72 systems compared with each Blackwell GPU in an Nvidia DGX B200 server running the benchmark with traditional serving.
Nvidia also made its first submissions this round using the Nvidia Dynamo inference
framework.
Nvidia partners — including cloud service providers and server makers — submitted great
results using the Nvidia Blackwell and/or Hopper platform. These partners include Azure,
Broadcom, Cisco, Dell Technologies, Giga Computing, HPE, Lambda, Lenovo, Nebius, Oracle, Quanta Cloud Technology, Supermicro and the University of Florida.
The market-leading inference performance on the NVIDIA AI platform is available from major cloud providers and server makers. This translates to lower TCO and enhanced return on investment for organizations deploying sophisticated AI applications.
“Our next architecture will be Rubin, which is coming in 2026, and we’re looking very much forward to bringing that to market and talking to you about it in the future,” Salvatore said.