Nvidia argued that technological democratization is amplified through broad availability of its systems and open source software specifically designed to accelerate AI, from the cloud and data center to desktops and edge devices.
At the Hot Chips event in Silicon Valley, the Santa Clara, California-based company noted Nvidia’s Blackwell GPU architecture is a purpose-built AI superchip. It packs fifth-generation Tensor Cores and a new numerical format, NVFP4 (4-bit floating-point), to deliver massive compute performance with high accuracy.
This architecture also integrates Nvidia NVLink‐72 next-generation high-bandwidth
interconnect, enabling ultra-fast GPU-to-GPU communication and scaling across multi-GPU configurations for demanding AI workloads. Blackwell GPUs also include second-generation Transformer Engines, and NVLink Fusion.
Accelerating AI requires more than powerful hardware and open source AI models—it demands an optimized and rapidly evolving software stack to deliver optimal performance for today’s demanding AI workloads. The Hot Chips event has been happening for decades in Silicon Valley and it’s often a place where companies come to convince engineers to believe in their hardware ecosystems. That’s why Nvidia has come for many years.
Nvidia said it is democratizing access to cutting-edge AI capabilities by releasing open source tools, models, and datasets for developers to innovate at the system level. You can find 1,000+ open source tools through Nvidia GitHub repos, and the Nvidia Hugging Face collections offer 450+ models and 80+ datasets.
This comprehensive approach to open source extends across the Nvidia software stack—from fundamental data processing tools to complete AI development and deployment frameworks. Nvidia said it publishes multiple open source CUDA-X libraries that accelerate entire ecosystems of interconnected tools, ensuring that developers can leverage the full potential of open source AI on hardware like Blackwell.
The open source AI tool development pipeline begins with data preparation and analytics. RAPIDS is an open source suite of GPU-accelerated Python libraries for accelerating the data preparation and ETL (Extract, Transform, Load) pipeline that feed directly into model training. RAPIDS ensures that AI workloads can run end-to-end on GPUs, eliminating costly CPU bottlenecks and enabling faster training and inference.
Once the data pipeline is accelerated, the next step is model training. Nvidia NeMo framework is an end-to-end training framework for large language models (LLMs), multimodal models, and speech models. It enables seamless scaling of pretraining and post-training workloads from a single GPU to thousand-node clusters for Hugging Face/PyTorch and Megatron models.
Nvidia PhysicsNeMo is a framework for physics-informed machine learning (Physics-ML) that enables researchers and engineers to integrate physical laws into neural networks, accelerating digital twin evelopment and scientific simulations.
Nvidia BioNeMo brings generative AI to the life sciences, providing pretrained models as accelerated NIM microservices, s as well astools for protein structure prediction, molecular design, and drug discovery—empowering researchers to accelerate breakthroughs in biology and healthcare.
These frameworks leverage NCCL, an open source CUDA-X library for multi-GPU and multi-node collective communication. Nvidia NeMo, Nvidia PhysicsNeMo, and Nvidia BioNeMo extend PyTorch with advanced generative capabilities, enabling developers to build, customize, and deploy powerful generative AI applications beyond standard deep learning workflows.
The Nvidia AI software stack is already powering millions of developer workflows worldwide from academic research labs to Fortune 500 companies, enabling teams to harness the full potential of cutting-edge GPUs like Blackwell. By combining breakthrough hardware innovations such as NVFP4 precision, second-generation Transformer Engines, and NVLink Fusion with an unmatched collection of open source frameworks, pretrained models, and optimized libraries, Nvidia ensures that AI innovation scales seamlessly from prototype to production.
New ways of scaling
The exponential growth in AI model complexity has driven parameter counts from
millions to trillions, requiring unprecedented computational resources that require
clusters of GPUs to accommodate.
The adoption of mixture-of-experts (MoE) architectures and AI reasoning with test-time scaling increases compute demands even more. To efficiently deploy inference, AI systems have evolved toward large-scale parallelization strategies, including tensor, pipeline, and expert parallelism. This is driving the need for larger domains of GPUs connected by a memory-semantic scale-upcompute fabric to operate as a unified pool of compute and memory.
Nvidia first introduced NVLink in 2016 to overcome the limitations of PCIe in high-performance computing and AI workloads. It enabled faster GPU-to-GPU communication and created a unified memory space.
In 2018, the introduction of Nvidia NVLink Switch technology achieved 300 GB/s all-to-all bandwidth between every GPU in an 8-GPU topology, paving the way for scale-up compute fabrics in the multi-GPU compute era. Nvidia Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology was introduced with the third-generation NVLink Switch for performance benefits such as optimized bandwidth reductions and collective operation latency reduction.
With the fifth-generation NVLink released in 2024, NVLink Switch enhancements support 72 GPUs all-to-all communication at 1,800 GB/s, giving 130TB/s of aggregate bandwidth—800x more than the first generation.
Despite being production-deployed at scale for nearly a decade, NVIDIA continues to push the limits, delivering the next three NVLink generations at an annual pace. This approach delivers continuous technological advancement that matches the exponential growth in AI model complexity and computational requirements.
NVLink performance relies on hardware and communication libraries—notably the
NVIDIA Collective Communication Library (NCCL).
NCCL was developed as an open-source library to accelerate communication between
GPUs in single-node and multi-node topologies, achieving near-theoretical bandwidth
for GPU-to-GPU communication. It seamlessly supports scale-up and scale-out and includes automatic topology awareness and optimizations. NCCL is integrated into every major deep learning framework, benefiting from 10 years of development and 10 years of production deployment.
Maximizing AI factory revenue
Nvidia hardware and library experience with NVLink, along with a large domain size, meet today’s AI reasoning compute needs. The 72-GPU rack architecture plays a crucial role in this alignment by enabling optimal inference performance across use cases.
When evaluating LLM inference performance, the frontier Pareto curves show the
balance between throughput per watt and latency.
The goal for AI factory productivity and revenue is to maximize the area under the curve.
Many variables affect the curve dynamics, including raw compute, memory capacity, and throughput, along with scale-up technology that enables optimizations across tensor, pipeline, expert parallel, etc., with high-speed communication.
Nvidia introduced NVLink Fusion to give hyperscalers access to all of the NVLink
production-proven scale-up technologies. It enables custom silicon (CPUs and XPUs) to
integrate with Nvidia NVLink scale-up fabric technology and rack-scale architecture for
semi-custom AI infrastructure deployment.
The NVLink scale-up fabric technology access includes the NVLink SERDES, NVLink chiplets, NVLink Switches, and all aspects of the rack-scale architecture. The high-density rack-scale architecture includes the NLVink spine, copper cable system, mechanical innovations, advanced power and liquid cooling technology, and an ecosystem with supply chain readiness.
NVLink Fusion offers versatile solutions for custom CPU, custom XPU, or combined custom CPU and custom XPU configurations. Being available as a modular Open Compute Project (OCP) MGX rack solution enables NVLink Fusion integration with any NIC, DPU, or scale-out switch, giving customers the flexibility to build what they need.
For custom XPU configurations, the interface to NVLink utilizes integration of Universal Chiplet Interconnect Express (UCIe) IP and interface. Nvidia provides the bridge chiplet
for UCIe to NVLink for the highest performance and ease of integration while still giving
adopters with the same level of access to NVLink capabilities as NVIDIA. UCIe is an
open standard, and by using this interface for NVLink integration, it gives customers the
flexibility to choose other options for their XPU integration needs across their current or
future platforms.
As artificial intelligence redefines the computing landscape, the network has become the critical backbone shaping the data center of the future. Large language model training performance is determined not only by compute resources but by the agility, capacity, and intelligence of the underlying network. The industry is witnessing the evolution from traditional, CPU-centric infrastructures toward tightly-coupled, GPU-driven, network-defined AI factories’
Nvidia also announced Nvidia Spectrum-XGS Ethernet, a scale-across technology for combining distributed data centers into unified, giga-scale AI super-factories.
As AI demand surges, individual data centers are reaching the limits of power and capacity within a single facility. To expand, data centers must scale beyond any one building, which is limited by off-the-shelf Ethernet networking infrastructure with high latency and jitter and unpredictable performance.
Spectrum-XGS Ethernet is a breakthrough addition to the Nvidia Spectrum-X Ethernet platform that removes these boundaries by introducing scale-across infrastructure. It serves as a third pillar of AI computing beyond scale-up and scale-out, designed for extending the extreme performance and scale of Spectrum-X Ethernet to interconnect multiple, distributed data centers to form massive AI super-factories capable of giga-scale intelligence.
“The AI industrial revolution is here and giant scale AI factories are the essential
infrastructure,” said Jensen Huang, founder and CEO of Nvidia. “With Nvidia Spectrum-XGS Ethernet, we add scale-across to scale-up and scale-out capabilities to link data centers across cities, nations and continents into vast, giga-scale AI super-factories.”
Spectrum-XGS Ethernet is fully integrated into the Spectrum-X platform, featuring
algorithms that dynamically adapt the network to the distance between data center
facilities.
With advanced, auto-adjusted distance congestion control, precision latency management and end-to-end telemetry, Spectrum-XGS Ethernet nearly doubles the performance of the NVIDIA Collective Communications Library, accelerating multi-GPU and multi-node communication to deliver predictable performance across geographically distributed AI clusters. As a result, multiple data centers can operate as a single AI super-factory, fully optimized for long-distance connectivity.
Hyperscale pioneers embracing the new infrastructure include CoreWeave, which
will be among the first to connect its data centers with Spectrum-XGS Ethernet.
“CoreWeave’s mission is to deliver the most powerful AI infrastructure to innovators
everywhere,” said Peter Salanki, co-founder and CTO of CoreWeave. “With Nvidia
Spectrum-XGS, we can connect our data centers into a single, unified supercomputer, giving our customers access to giga-scale AI that will accelerate breakthroughs across every industry.”
The Spectrum-X Ethernet networking platform provides 1.6x greater bandwidth
density than off-the-shelf Ethernet for multi-tenant, hyperscale AI factories —
including the world’s largest AI supercomputer. It comprises NVIDIA Spectrum-X
switches and NVIDIA ConnectX-8 SuperNICs, delivering seamless scalability,
ultralow latency and breakthrough performance for enterprises building the future
of AI.
Today’s announcement follows a drumbeat of networking innovation
announcements from Nvidia, including Nvidia Spectrum-X and Nvidia Quantum-X
silicon photonics networking switches, which enable AI factories to connect millions
of GPUs across sites while reducing energy consumption and operational costs.
Nvidia Spectrum-XGS Ethernet is available now as part of the Nvidia Spectrum-X
Ethernet platform.