NVIDIA GPU H100 Architecture In-Depth Analysis

The NVIDIA H100 GPU is the ninth-generation data center-class GPU announced by NVIDIA at the GTC conference in March 2022. Based on the all-new Hopper architecture, it replaces the previous-generation Ampere architecture (A100). Named after computer science pioneer Grace Hopper, this architecture is designed to address computational bottlenecks in large-model training, real-time inference, and high-performance computing (HPC). The emergence of the Hopper architecture marks a significant evolution in GPU technology, shifting from general-purpose computing toward specialized acceleration. Previously, the A100 (Ampere architecture, released in 2020) set the standard for AI and scientific computing with its 7nm process, third-generation Tensor Cores, and HBM2e memory.However, with the rise of large models with hundreds of billions of parameters (such as GPT-4), traditional architectures face challenges in computational density, interconnect bandwidth, and energy efficiency. The H100 achieves a generational performance leap through TSMC’s 4N custom process, the integration of 80 billion transistors (a 48% increase over the A100), and six key technological breakthroughs.

Core Technological Innovations and Hardware Features

Architectural Design: The Leap from
Ampere to Hopper The core improvement of the Hopper architecture lies in the reconstruction of the parallel computing paradigm. The H100 is the first truly asynchronous GPU, expanding the A100’s global-to-shared-memory asynchronous transfer capabilities to cover the entire address space and adding support for tensor memory access modes.This allows applications to build end-to-end asynchronous pipelines, achieving complete overlap and hiding of data transfer in and out of the chip with computation.While the A100’s Ampere architecture focuses primarily on general-purpose computing, the H100, through Thread Block Clusters and Asynchronous Transaction Barriers, achieves cross-stream multi-processor (SM) cooperative scheduling and data sharing for the first time. This design enables multiple thread blocks to efficiently drive Tensor Memory Accelerators (TMA) and Tensor Cores, significantly improving compute pipeline utilization.

Computing Units: Fourth-Generation Tensor Cores and the FP8
Revolution The H100’s fourth-generation Tensor Cores support FP8 precision, delivering sparse computing performance of up to 4,000 TFLOPS—a sixfold increase over the A100’s FP16 performance.The introduction of FP8 not only reduces memory consumption (by 70%) but also enables the Transformer Engine to dynamically switch between FP8 and FP16 precision, automatically handling weight scaling and precision compensation. Benchmark tests show that when training a 175-billion-parameter GPT-3 model, the H100 achieves a 9x speedup over the A100, with inference latency reduced to sub-second levels. Interconnect and Memory:
Breakthroughs with NVLink 4.0 and HBM3 The H100 features fourth-generation NVLink, delivering 900 GB/s of single-card interconnect bandwidth. Combined with NVSwitch technology, it can scale to a 256-card cluster, achieving a full-to-full bandwidth of 57.6 TB/s—seven times that of PCIe Gen5.In terms of memory, HBM3 technology boosts bandwidth to 3.35 TB/s (a 68% increase over the A100’s HBM2e), and when paired with 50 MB of L2 cache, it can handle data throughput on the scale of the global internet.

Security and Energy Efficiency: Confidential Computing and Energy
Optimization The H100 introduces confidential computing capabilities for the first time, protecting sensitive data (such as medical genomics) through hardware isolation. It also supports second-generation Multi-Instance GPU (MIG), which divides a single card into seven independent instances, providing a secure multi-tenant environment for cloud services.Although the H100’s typical power consumption reaches 700W (a 75% increase over the A100), its FP16 performance per watt has improved to 2.83 TFLOPS/W, and the three-year total cost of ownership (TCO) has been reduced by 28%. Figure 6 shows the full-scale GH100 GPU, which contains 144 streaming multiprocessors (SMs).The mass-produced H100 GPU is available in two versions based on form factor: the SXM5 interface version contains 132 SMs, while the PCIe version is reduced to 114 SMs. It is important to note that the core design goal of the H100 GPU is to provide acceleration for artificial intelligence (AI), high-performance computing (HPC), and data analysis tasks in data center and edge computing scenarios; its hardware architecture is not optimized for traditional graphics rendering.In both the SXM5 and PCIe versions of the H100 GPU, only two texture processing clusters (TPCs) possess graphics processing capabilities, meaning they support the execution of vertex shaders, geometry shaders, and pixel shaders——This design implies that while the H100 retains a basic graphics pipeline, it is essentially a dedicated accelerator designed for compute-intensive workloads.

v2-8a21ad165e023690b4f9c57b37131749_1440

Building on the Streaming Multiprocessor (SM) architecture of the NVIDIA A100 Tensor Core GPU, the H100’s SMs introduce FP8 precision to deliver a fourfold increase in single-SM floating-point performance at the same clock frequency, while doubling the computational power of the FP32/FP64 data types supported by the original Tensor Cores. The combination of the new Transformer Engine and the FP8 Tensor Cores in the Hopper architecture enables the H100 to achieve up to 9x faster training speeds for large language models compared to the previous-generation A100, with inference speeds increased by 30x. The new DPX instruction set in Hopper accelerates the Smith-Waterman algorithm—used in genomics and protein sequencing—by 7x.The fourth-generation Tensor Cores, Tensor Memory Accelerators, and overall optimizations to the SM architecture enable the H100 to achieve up to a 3x performance leap in most high-performance computing (HPC) and AI scenarios.

Computational Performance:

v2-8a2d6578c4b97b5af0a04943f0762dd3_1440

SM Architecture:

v2-9648c8936c18854786163bcbac09a92f_1440

H100 SXM Version vs. PCIe Version

Specifications	H100 SXM5	H100 PCIe
Number of SMs	132	114 (14% reduction)
Number of Tensor Cores	528	456 (14% decrease)
FP16 sparse performance	1.98 PetaFLOPS	1.51 PetaFLOPS
FP8 sparse performance	3.96 PetaFLOPS	3.03 PetaFLOPS
Graphics memory bandwidth	3 TB/s (HBM3)	2 TB/s (HBM2e)

Key Differences: The SXM5 offers an absolute performance advantage of approximately 30% through a higher number of SMs and HBM3 memory, making it particularly suitable for compute-intensive workloads.

Use Case Impact: The SXM5 offers lower latency in multi-GPU training (e.g., LLM), while the PCIe version is better suited for single-card inference or small-scale clusters.

Interconnect Technology and Scalability

1. NVLink Interconnect

(1) SXM5: Integrates 18 fourth-generation NVLink links, with a total bandwidth of 900 GB/s, supporting full interconnection of 8 cards (via NVSwitch).

(2) PCIe: Optional NVLink bridging (2-card interconnect) with a bandwidth of **600 GB/s**, but requires physical space and has limited scalability.

2. PCIe Gen5 Interface

(1) SXM5: Used exclusively for CPU communication, with a bandwidth of 64 GB/s (bidirectional).

(2) PCIe: Primary communication interface, bandwidth 128 GB/s (bidirectional), supports atomic operations to optimize CPU-GPU synchronization.

Thermal Management and Power Consumption

1. TDP Power Consumption

(1) SXM5: 700W; requires liquid cooling or custom air cooling (e.g., DGX systems).

(2) PCIe: 350W, compatible with standard server cooling solutions.

2. Energy Efficiency: SXM5 offers higher performance per watt under full load, while PCIe is better suited for energy-sensitive scenarios.

SM Architecture

Fourth-generation Tensor Cores

Designed as high-performance computing units specifically for matrix multiply-add (MMA) operations, Tensor Cores have become the core acceleration engines for AI training, inference, and scientific computing. Their unique architecture significantly surpasses the throughput and energy efficiency of traditional floating-point (FP), integer (INT), and fused multiply-add (FMA) units by parallelizing matrix operations. Since the introduction of Tensor Cores in the NVIDIA Tesla® V100 in 2017, each architectural iteration has delivered breakthrough upgrades. The fourth-generation Tensor Cores featured in the H100 deliver three major innovations:

Doubled compute density: At the same clock frequency, the H100 delivers twice the dense matrix compute throughput per Streaming Multiprocessor (SM) compared to the A100, with even greater efficiency gains for sparse matrix operations. When combined with the H100’s higher GPU Boost frequency (approximately 30% higher than the A100), actual performance gains can reach 2.6x.
Full data type support: Covering FP8, FP16, BF16, TF32, FP64, and INT8 precisions, it meets diverse needs ranging from large-model training to edge inference. Among these, FP8 and TF32 (Tensor Floating-Point 32-bit) are specifically optimized for Transformer-class models, reducing VRAM usage by 70% and boosting computational throughput by 3x.
Energy Efficiency Optimization: Through data compression and on-chip cache optimization, power consumption for operand transfer is reduced by 30%, which is critical for achieving energy efficiency balance in the H100 with a 700W TDP.

Hopper FP8 Data Format

The H100 GPU introduces new FP8 Tensor Cores, designed specifically to enhance AI training and inference performance. As shown in Figure 9, the FP8 Tensor Cores support the following features: Mixed-precision computing: Supports FP32 and FP16 accumulators, compatible with a variety of precision requirements. Two new FP8 input formats:

E4M3: Consists of 4 exponent bits, 3 mantissa bits, and 1 sign bit, suitable for computational scenarios requiring higher precision but with a narrower dynamic range.
E5M2: Consists of 5 exponent bits, 2 mantissa bits, and 1 sign bit, offering a wider dynamic range but slightly lower precision.

v2-8fe0b3e58c20af291d1a52edac7e189b_1440

Advantages of FP8:

Double the storage efficiency: Compared to FP16 or BF16, FP8 halves data storage requirements, significantly reducing GPU memory usage.
Double the throughput: On the same hardware scale, the theoretical computational throughput of FP8 is twice that of FP16 or BF16.

Transformer Engine Optimizations (see subsequent sections):
The H100 dynamically combines FP8 and FP16 precision through its innovative Transformer Engine:

Memory optimization: Intelligently selects FP8 for storing intermediate results in matrix multiplication and attention mechanisms, reducing pressure on memory bandwidth.
Performance improvements: Accelerates training and inference through mixed-precision computing, while ensuring accuracy in high-precision scenarios—such as large language models—via lossless format conversion techniques (e.g., scaling factor compensation).

v2-1fd8b4281f42ac30aae1685efa972778_1440

H100 GPU Hierarchy and Asynchrony Improvements

In parallel programming, data locality and asynchronous execution are key to improving performance. Data
locality requires moving program data as close as possible to execution units (such as the GPU’s Streaming Multiprocessor, SM) to leverage the advantages of low-latency, high-bandwidth local data access. For example, caching frequently used data in the SM’s shared memory or registers can reduce the overhead of global GPU memory access. Asynchronous execution, on the other hand, focuses on identifying independent tasks and overlapping their execution with memory transfers or other computational processes. For instance, while the GPU performs kernel function computations, the next batch of data can be preloaded via an asynchronous memory copy engine (such as CUDA Stream), thereby hiding data transfer latency. The core objective is to maintain full utilization of all GPU compute units and avoid performance losses caused by idle time. The Hopper architecture introduces Thread Block Clusters at the programming level. This design enables, for the first time, collaboration among thread blocks across multiple SMs, extending the benefits of data locality to a scale larger than that of a single SM thread block.Additionally, Hopper’s Asynchronous Transaction Barrier and Tensor Memory Accelerator significantly reduce synchronization overhead and enhance asynchronous execution efficiency. For example, they allow threads and accelerator units across clusters to collaboratively handle data dependencies while keeping the compute pipeline running continuously.

Thread Block Cluster

Since the inception of the CUDA programming model, its core design has revolved around a hierarchical structure of Grids and Thread Blocks.In the traditional (three-tier) model, a thread block contains multiple threads that execute concurrently on a single SM, exchanging data via fast barriers and shared memory. However, as modern GPUs now feature over 100 SMs (e.g., the H100 has 144 SMs) and the complexity of computational tasks has surged, relying solely on thread blocks as units of locality is no longer sufficient to fully unlock the hardware’s performance potential. The Thread Block Cluster architecture introduced in the H100 extends the granularity of locality control in the programming model from thread blocks within a single SM to collaborative units spanning multiple SMs. This design adds a fourth-level structure to the CUDA programming hierarchy (four-tier model):

Thread → Thread Block → Thread Block Cluster → Grid

Key Features:

1. Cross-SM Cooperative Scheduling: A cluster consists of a set of thread blocks that are guaranteed to be concurrently scheduled across multiple SMs within the same Graphics Processing Cluster (GPC). A GPC is a physical unit at the hardware level, where the SMs are closely adjacent in chip layout, ensuring low-latency communication.

2. Hardware-Accelerated Barriers and Memory Collaboration: Clusters support hardware-accelerated barrier synchronization (reducing latency by 90% compared to traditional software barriers) and introduce new cross-SM memory collaboration capabilities. For example, through a dedicated SM-to-SM network, threads within a cluster can directly access shared memory on other SMs, enabling efficient data sharing.

3. Flexible Programming Interfaces: Developers can dynamically group thread blocks within a grid into Clusters via the CUDA Cooperative Groups API during kernel launch (as shown in Figure 14). This flexibility allows for customizing Cluster sizes to suit different algorithms (such as dynamic programming or multi-stage model inference), for example, by binding eight thread blocks into a single Cluster to handle data-dependent intensive tasks.

v2-087fbbe2094c217801867f4f6d918e7b_1440

Comparison Dimensions	Three-Layer Model	Four-Layer Model (Compute Capability 9.0+)
Scheduling Unit	Thread blocks (Blocks) are executed independently by a single SM and cannot collaborate across SMs.	A thread block cluster (Cluster) consists of multiple blocks; blocks within a cluster are guaranteed to be scheduled collaboratively across multiple SMs within the same GPC (GPU Processing Cluster).
Resource Allocation Scope	Each SM manages its allocated thread block resources (registers, shared memory) independently.	Thread blocks within a cluster share GPC-level resources (such as L2 cache), but each block has exclusive access to the registers and shared memory of its own SM.
Execution Granularity and Scope of Collaboration	Threads within a block collaborate via shared memory and `__syncthreads()`, with the scope of collaboration limited to a single SM.	Blocks spanning multiple SMs within a cluster can synchronize via the Cluster Group API (e.g., `cluster.sync()`), supporting cluster-level coordination (e.g., sharing global memory data).
Latency and Communication Efficiency	Cross-block communication must go through global memory, resulting in higher latency (limited by GPU memory bandwidth).	Blocks within a cluster can utilize distributed shared memory (DSMEM technology, introduced with H100), eliminating the need for shared memory and reducing cross-SM communication latency.
Programming Model	Only the grid and block dimensions need to be defined; there is no explicit cluster declaration.	The Cluster dimension must be explicitly declared (e.g., __cluster_dims__(X,Y,Z)) or dynamically configured via cudaLaunchKernelEx.
Hardware Resource Utilization	The number of thread blocks per SM is limited by its resources (e.g., shared memory, registers).	A Cluster can distribute blocks across multiple SMs, improving overall resource utilization within the GPC through resource reuse and load balancing.
Applicable Scenarios	Simple parallel tasks (such as independent matrix operations) and scenarios that do not require inter-block collaboration.	Complex collaborative tasks (e.g., cross-block data exchange, large-scale graph traversal), algorithms requiring low-latency synchronization or sharing of intermediate results.
Performance Optimization Potential	Optimization relies on thread locality within a block and is limited by the parallel processing capacity of a single SM.	Significantly improves throughput through cluster-level data locality (e.g., L2 cache reuse) and parallel execution across SMs.
Hardware Support	All CUDA-compatible GPUs.	GPUs with Compute Capability 9.0 or higher only (e.g., the H100 based on the Hopper architecture).

Performance Improvements and Use Cases

1. Large-scale parallel computing: In genome sequence alignment (Smith-Waterman algorithm), the cluster architecture reduces cross-SM data exchange latency from microseconds to nanoseconds, achieving a 7x speedup.

2. AI Model Training: The multi-head attention mechanism in Transformer-based models can reduce global memory accesses through intra-cluster thread collaboration, thereby improving compute pipeline utilization.

3. Real-time data processing: The cluster’s hardware barriers support efficient synchronization of streaming tasks, making it suitable for multimodal fusion computing in autonomous driving sensors.

The introduction of thread block clusters marks a paradigm shift in CUDA from "single-SM optimization" to "multi-SM collaboration."By exposing cross-SM locality control capabilities, the H100 enables developers to manage data flows and computational dependencies with greater precision, thereby maximizing hardware resource utilization in scenarios such as training models with hundreds of billions of parameters and ultra-large-scale scientific simulations. This architectural evolution lays the foundation for future deep optimization of GPUs in the field of heterogeneous computing.

Distributed Shared Memory

The H100’s Thread Block Cluster architecture enables efficient cross-SM data exchange through Distributed Shared Memory (DSMEM).All threads within a cluster can directly access the shared memory of other SMs through load, store, and atomic operations—DSMEM logically integrates the shared memory virtual address spaces of each thread block into a unified distributed resource pool. This design eliminates the bottleneck of traditional cross-SM data exchange, which relies on global memory, significantly improving collaboration efficiency between thread blocks.

Core Features of DSMEM

1. Low-Latency Data Channel: Through the SM-to-SM Network, DSMEM access latency is reduced by 90% compared to global memory, and data exchange speeds are increased by approximately 7 times. For example, in dynamic programming algorithms, the transmission of intermediate results across SMs can be accomplished directly via DSMEM without the need for global memory as an intermediary.

2. Unified Address Space: DSMEM segments from all thread blocks within the cluster are mapped to a unified address space for each thread, allowing developers to directly access remote shared memory using standard pointers. Using CUDA’scooperative_groups API, developers can dynamically construct generic pointers to the memory of any thread block within the cluster.

3. Asynchronous Operations and Barrier Synchronization DSMEM supports asynchronous copy operations based on shared memory barriers. For example, in streaming data processing, a thread can asynchronously load the next batch of data into DSMEM and synchronize task states via hardware-accelerated barriers, achieving full overlap between computation and data transfer.

v2-36a90410a509e7a5a368bfadbfd2378c_1440

Asynchronous Execution

Architectural innovations in every generation of NVIDIA GPUs focus on performance leaps, programming flexibility, energy efficiency optimization, and improved hardware utilization, aiming to address increasingly complex computational demands. In recent years, one of the core directions of architectural evolution has been to maximize hardware resource utilization by enabling the overlapping execution of data transfer, computational tasks, and synchronization operations through enhanced asynchronous execution capabilities.

Innovations in Asynchronous Execution in the Hopper Architecture

Building on the achievements of previous generations, the Hopper architecture introduces several key features to further break through performance bottlenecks:

Deep Overlapping of Computation and Memory Transfer: Through Asynchronous Transaction Barriers (ATBs) and Tensor Memory Accelerators (TMAs), the memory copy engine and compute units can operate in parallel. For example, when training Transformer models, the TMA can preload weights for the next layer while computing the current layer, eliminating “bubble” wait times found in traditional pipelines.
Independent Task Decoupling and Parallelization: A new task-level dependency management mechanism enables the decoupling of tasks such as memory copying, kernel execution, and cross-GPU communication into independent operational units. In recommendation system inference, feature data loading, model computation, and result feedback can be executed completely asynchronously, reducing end-to-end latency by 40%.
Minimized Synchronization Overhead Hardware-level fine-grained synchronization primitives replace global barriers, synchronizing only subsets of threads with data dependencies. In molecular dynamics simulations, local synchronization of inter-particle force calculations reduces synchronization overhead from 15% to 2%.

Real-World Performance Improvements 1. Hardware Utilization: Hopper’s asynchronous scheduler increases the active time percentage of SM compute units from 85% on A100 to 97%, approaching a “zero idle” state.

Energy Efficiency: Under identical workloads, Hopper delivers 2.3 times the performance per watt compared to the Ampere architecture, which is critical for optimizing data center-level energy efficiency metrics such as PUE.
Simplified Programming: Developers can use the CUDA Graph API to encapsulate asynchronous task chains as atomic operations, eliminating the need to manually manage hundreds of CUDA streams and reducing code complexity by 70%.

These improvements enable the Hopper architecture to unleash extreme computing power in scenarios such as AI training, scientific computing, and real-time inference, while also achieving "invisible" performance gains through intelligent task orchestration, laying the foundation for next-generation exascale computing.

v2-289d4ae4993bd9344e1b6a4b2c37baa1_1440

Tensor Memory Accelerator (TMA)

To meet the high-throughput demands of the new H100 Tensor Cores, NVIDIA significantly enhances data retrieval efficiency through the Tensor Memory Accelerator (TMA). This hardware unit supports the efficient transfer of large data blocks and multi-dimensional tensors between global memory and shared memory.TMA operations are initiated via a copy descriptor, which specifies data transfer logic based on tensor dimensions and block coordinates, eliminating the traditional element-by-element addressing model (see Figure 18). Users can define data blocks up to the capacity of shared memory to load data from global memory into shared memory or store data in the reverse direction.

v2-a9565265c292cddb32c49153e9b5dc75_1440

The core advantage of TMA lies in significantly reducing addressing overhead and improving efficiency. Its technical features include:

Support for 1D–5D tensor layouts;
Support for multiple memory access modes;
Built-in advanced features such as reduced-order computation

This operation employs an asynchronous execution mechanism, relying on the shared memory asynchronous barriers introduced by the A100 architecture to achieve synchronization control. In the programming model, a single thread within a warp is selected to execute the asynchronous TMA operation (cuda::memcpy_async) to complete the tensor copy, while the remaining threads can use cuda::barrier to wait for the data transfer to finish.The H100’s SMs feature new hardware accelerators specifically optimized to enhance the execution efficiency of asynchronous barrier wait operations, which will be discussed in the next section.

Compared to previous-generation architectures, TMA delivers revolutionary improvements:

A100 approach: Relies on a special instruction (LoadGlobalStoreShared) to perform asynchronous memory copying; threads must generate all addresses and traverse the entire copy region (left side of Figure 19)

Hopper approach: After a single thread creates a copy descriptor, address generation and data movement are fully handled by the hardware (Figure 19, right). TMA significantly simplifies the programming model by automatically handling tasks such as stride calculation, offset positioning, and boundary checks during tensor segmented copying.

v2-ee86be6fe6c5b8b79d4df479866aab1d_1440

The advent of TMA marks the evolution of the GPU memory subsystem from “software-scheduled” to “hardware-autonomous.” By offloading complex data movement tasks to dedicated acceleration units, the H100 allows developers to focus more on the algorithmic logic itself, paving the way for the computational demands of the era of trillion-parameter models.

Asynchronous Transaction Barriers (Asynchronous Barriers) were first introduced in the Ampere GPU architecture. See the left side of Figure 20. Consider an example: a group of threads is producing data, which will all be consumed after the barrier.Asynchronous barriers split the synchronization process into two steps. First, a thread signals “Arrive” upon completing the production of its portion of the shared data. This “Arrive” is non-blocking, so the thread is free to execute other independent work. Eventually, the thread requires the data produced by all other threads. At this point, it performs a “Wait” operation, which blocks it until every thread has signaled “Arrive.”

The advantage of asynchronous barriers is that they allow threads that arrive early to perform independent work while waiting. This overlap is a source of additional performance. If all threads have sufficient independent work, the barrier effectively becomes “free,” because Wait the instruction can complete immediately (since all threads have already arrived).

A new feature in Hopper is that "waiting" threads can enter a sleep state until all other threads have arrived. On previous chips, waiting threads would spin on the barrier object in shared memory.

Although asynchronous barriers remain part of the Hopper programming model, Hopper introduces a new form of barrier called the Asynchronous Transaction Barrier. The Asynchronous Transaction Barrier is very similar to the asynchronous barrier (see the right-hand side of Figure 20). It is also a split barrier, but it counts not only the number of thread arrivals but also the number of transactions.Hopper includes a new instruction for writing to shared memory that passes both the data to be written and a transaction count. The transaction count is essentially a byte count. The asynchronous transaction barrier blocks threads at Wait the instruction until all producer threads have executed Arriveand the sum of all transaction counts reaches the expected value.

The Asynchronous Transaction Barrier is a dual enhancement implemented by Hopper based on the asynchronous barrier:

1. Dual counting mechanism

(1) It counts both the number of thread arrivals

(2) It simultaneously counts transaction counts (Transaction Count), which is essentially a byte count

2. Hardware-level blocking control

(1) New instructions written to shared memory must include data and a transaction count
(2) The "Wait" instruction blocks the thread until the following conditions are met: ✓ All producer threads have completed "Arrive"; ✓ The total transaction count reaches the expected value

v2-ebf2a2b6ec12f669bdcdfd1502ab1b7f_1440

Transformer Engine

Transformer models are the backbone of widely used language models today (such as BERT and GPT-3). Although initially designed for natural language processing (NLP), their versatility has expanded into fields such as computer vision and drug discovery.

Key Challenges:

1. Exponential growth in scale: Model parameter counts have reached the trillions, and training times have extended to several months;

2. Explosive growth in computational demand: For example, training Megatron Turing NLG (MT-NLG) requires 2,048 NVIDIA A100 GPUs running for eight weeks;

3. Growth far outpaces other AI models: Over the past five years, the scale of Transformer models has increased 275-fold every two years.

The H100 GPU integrates a new Transformer Engine based on customized Hopper Tensor Core technology, significantly accelerating AI computations for Transformers.

Its core innovations include:

1. Intelligent mixed-precision management

Objective: To improve performance by utilizing smaller, faster numerical formats (such as FP8) while maintaining accuracy;

Dynamic decision-making process:

Step 1: Analyze the statistical values of the Tensor Core output;

Step 2: Predict the precision type required by the next layer of the neural network;

Step 3: Dynamically convert the tensor to the target format (FP8 or FP16) before storing it in memory.

2. Dynamic Optimization of the FP8 Range

Problem: The numerical representation range of FP8 is more limited than other formats (such as FP16);

Solution: Calculate scaling factors based on tensor statistics; dynamically scale tensor data to fit within the representable range of FP8;

Result: Each layer of the neural network operates within the required numerical range with precision, while accelerating computations in an optimal manner.

Technical Value

Innovative Features	Limitations of Traditional Approaches	Advantages of the Transformer Engine
Precision Management	Manual Static Precision Configuration	Dynamic Mixed Precision (Adaptive Switching Between FP8 and FP16)
Utilization of Numeric Range	Fixed scaling leads to precision loss	Dynamic scaling factor to match layer requirements
Performance improvement	A100 training takes several weeks	H100 training speed increased by 9x

This design enables the Hopper architecture to achieve a 50%reduction in VRAM usage and a 2x increase in throughput during training of models with trillions of parameters, providing foundational computing power for large-scale AI training.

v2-fb682de50e6519f40429af56f2d3e9af_1440

Transformer Engine (TE) is an acceleration library launched by NVIDIA GitHub - NVIDIA/TransformerEngine: A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating-point (FP8) precision on Hopper, Ada, and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.

It is specifically designed to efficiently run Transformer models (such as BERT, GPT, and T5) on GPUs based on architectures like Hopper, Ada, and Blackwell. At its core, it leverages 8-bit floating-point (FP8) precision to significantly boost performance and drastically reduce memory consumption during both training and inference.TE offers a suite of highly optimized Transformer layer building blocks (such as linear layers and LayerNorm) and an easy-to-use API similar to auto-mixed precision, allowing developers to seamlessly integrate them into existing code within frameworks like PyTorch or JAX. Additionally, it includes a framework-agnostic C++ low-level library that serves as the foundation for FP8 support in other deep learning toolchains.
With the explosive growth in the number of parameters in Transformer models, their training and inference have become extremely memory- and computationally-intensive. Although mainstream frameworks support FP16 mixed-precision training to accelerate processing and conserve memory, the latest FP8 precision—which offers superior performance without loss of accuracy on GPUs like Hopper—is not natively supported by these frameworks.TE was created precisely to address this critical gap: it simplifies the construction of FP8 Transformer layers through a wrapped Python API. More importantly, its internal modules automatically manage key parameters—such as the complex dynamic scaling factors required for FP8 training—freeing users from the tedious management of low-precision details. This significantly lowers the barrier to entry for mixed-precision training, enabling developers to easily leverage the powerful advantages of FP8.

Fourth-Generation NVLink and NVLink Network

Emerging exascale HPC and AI models with trillions of parameters (such as superhuman conversational AI) still require months of training even when using supercomputers. To compress training cycles from months to days to meet commercial demands, high-speed, seamless communication must be achieved between every GPU in a server cluster.Traditional PCIe interfaces create bottlenecks due to limited bandwidth; building a powerful end-to-end computing platform requires faster, more scalable NVLink interconnect technology.

Fourth-Generation NVLink

Key Features NVLink is NVIDIA’s high-bandwidth, energy-efficient, low-latency, lossless GPU interconnect technology, featuring resilience mechanisms such as link-level error detection and packet retransmission to ensure reliable data transmission.

Bandwidth Leap: The fourth-generation NVLink on the H100 delivers 900 GB/s of communication bandwidth—a 1.5x increase over the third-generation NVLink on the A100 and 7x the bandwidth of PCIe Gen 5.

Physical Layer Optimization:

3rd-generation NVLink (A100): Uses 4 differential pairs per direction, with a bidirectional bandwidth of 25 GB/s per link;

4th-generation NVLink (H100): Requires only 2 high-speed differential pairs per direction, while maintaining 25 GB/s bidirectional bandwidth per link.

Scalability

The H100 integrates 18 NVLink links, providing a total bandwidth of 900 GB/s (compared to 12 links and 600 GB/s for the A100);

Supports cross-node interconnection of up to 256 GPUs, breaking through single-node limitations via NVLink Network technology.

NVLink Network

Architectural Innovations

1. Address Space Isolation:

Traditional NVLink: All GPUs share a physical address space, with requests routed directly;

NVLink Network: Introduces an independent network address space; isolates the address spaces of each GPU via H100’s built-in address translation hardware, enabling secure scaling.

2. Connection Mode

Similar to InfiniBand, users must explicitly establish connections between endpoints via software (not automatic global connections).

Cluster-Level Value

Provides 57.6 TB/s fully interconnected bandwidth for AI training clusters with thousands of GPUs (based on the third-generation NVSwitch);

Combined with a 2:1 tapered fat-tree topology, All-Reduce operation throughput is 4.5 times higher than InfiniBand.

Technical Specification Comparison

Features	4th-generation NVLink (H100)	3rd-generation NVLink (A100)	Gain
Single-link bandwidth	25 GB/s (bidirectional)	25 GB/s (bidirectional)	Physical Layer Efficiency Optimization
Total number of links	18	12	+50%
Aggregated Bandwidth	900 GB/s	600 GB/s	+1.5x
Maximum scalability	256 GPUs	8 GPUs (per node)	32x scalability
Compared to PCIe Gen 5	7x bandwidth advantage	4x bandwidth advantage	Generational improvement

Third-Generation NVSwitch

The third-generation NVSwitch includes switches deployed both inside and outside the node to connect multiple GPUs within servers, clusters, and data center environments. Each NVSwitch within a node provides 64 fourth-generation NVLink ports to accelerate multi-GPU interconnects. Total switch throughput has increased from 7.2 Tb/s in the previous generation to 13.6 Tb/s, representing an 89% increase in bandwidth. New hardware acceleration support for multicast and NVIDIA SHARP (In-Network Reductions) has been added, with the following specific optimizations:

1. Accelerated operation types

Write Broadcast / all_gather
Scatter Protocol (reduce_scatter)
Broadcast Atomics

v2-9edbbe2fa48a8ef60a049ade4f2c0a60_1440

2. Performance Gains

2x increase in throughput for small-block collective operations;
Significantly lower latency compared to A100 using NCCL (NVIDIA Collective Communications Library).

3. Computational Resource Offloading

NVSwitch’s hardware acceleration of collective operations significantly reduces communication overhead for Streaming Multiprocessors (SMs), freeing up SM resources to focus on computational tasks.

Metrics	3rd-Generation NVSwitch	Previous-Generation Solution	Performance Improvement
Total Throughput	13.6 Tb/s	7.2 Tb/s	+89%
Collective operation latency	Nanosecond-level hardware acceleration	Depends on the NCCL software layer	40%–60% reduction
SM communication load	Hardware offloading	Requires SM coordination	30%+ release of computing resources

New NVLink Switch System

By combining new NVLink Network technology with the third-generation NVSwitch, NVIDIA has built a large-scale, scale-up NVLink switching system network with unprecedented communication bandwidth.

1. Hierarchical Interconnect Design

Intra-node interconnect: Each GPU node exposes its full NVLink bandwidth externally with a 2:1 tapered level;

Inter-node interconnect: Multiple nodes are connected via NVLink Switch modules (containing the second-generation NVSwitch) deployed outside the compute nodes, forming a two-tier switching network.

2. Scale and Performance

Supports interconnection of up to 256 GPUs.
Provides 57.6 TB/s of all-to-all bandwidth.
Supports 1 exaFLOP of FP8 sparse AI computing performance

3. Cable and Interface Upgrades

Maximum cable length between switches: increased from 5 meters to 20 meters.
Dedicated cables: Supports NVIDIA’s proprietary OSFP (8-channel small form-factor pluggable) LinkX cables, featuring:

Four-port optical transceivers integrated into each OSFP module;
Supports 8-channel 100G PAM4 signal transmission.

Switch Density: A single 1RU NVLink switch can accommodate 32 slots, providing 128 NVLink ports with a data rate of 25 GB/s per port.

v2-e4be0726e260ed95ea586dc130d89369_1440

The DGX H100 SuperPOD can scale up to 256 GPUs, achieving full interconnectivity through a new NVLink switch based on third-generation NVSwitch technology. The NVLink Network interconnect, utilizing a 2:1 tapered fat-tree topology, delivers the following breakthroughs

A 9x increase in bisection bandwidth (e.g., in full-to-full data exchange scenarios);
4.5x increase in All-Reduce throughput compared to previous-generation InfiniBand systems.

Analysis Engineer: Ye Weiyangxin Yiyuan, Ph.D. in Computer Science, University of Electronic Science and Technology of China