How to Optimize NVIDIA CAGRA for GPU Building + CPU Querying with Cost-Efficiency in Mind

This is the fifth article in the Milvus Week series, which aims to compile the advanced technical practices and innovations accumulated by the Zilliz team over the past six months into a series of in-depth, practical articles.

Here is the content for DAY 5. Key points:

CAGRA is a graph indexing technology for billion-scale vector data, specifically designed for GPUs and launched by NVIDIA

GPU-based indexing combined with CPU-based retrieval is often more efficient and cost-effective in practical implementations

Milvus adapt_for_cpuParameters are key to controlling the serialization and deserialization behavior of the CAGRA index.

When dealing with high-dimensional vector data at the billion-scale—or even the hundred-billion or trillion-scale—how should one choose an indexing method to balance both search accuracy and efficiency?

The answer is undoubtedly graph-based indexing.

Graph-based indexes, represented by NSW, HNSW, CAGRA, and Vamana, achieve a balance between accuracy and efficiency by mapping high-dimensional vectors into navigable graph structures and using path navigation on the graph to quickly locate nearest neighbors during the retrieval phase.

However, in practical implementation, we often find that while graph-based indexes are highly efficient during the retrieval phase, the graph construction phase requires a large number of computationally intensive operations, placing high demands on hardware resources—and traditional CPUs are not well-suited for handling such parallel computing tasks.

It is precisely for this reason that the CAGRA index, specifically designed for GPU parallel computing acceleration, has gradually garnered attention over the past two years.

In response to this industry need, Milvus version 2.6.1 introduces flexible deployment options for the GPU-based CAGRA index, innovatively implementing a "GPU-based construction + CPU-based querying" hybrid model. This approach leverages CAGRA’s powerful GPU graph construction capabilities to ensure index quality while reusing HNSW’s mature CPU-based querying capabilities to reduce deployment costs, perfectly complementing the strengths of both.

This model is particularly suitable for scenarios with low data update frequencies, large-scale queries, and cost-sensitive requirements, offering a practical solution that balances performance and cost-effectiveness.

Below, we will provide a detailed breakdown of the construction principles and application details of CAGRA indexes in Milvus.

01 Understanding CAGRA

Currently, mainstream graph indexing technologies are primarily divided into two categories: iterative graph construction technologies represented by CAGRA (already implemented in Milvus), and insert-based graph construction technologies represented by Vamana (under development). The scenarios they target and their technical approaches differ significantly, each suited to different data scales and business requirements.

Among these, CAGRA is a representative of iterative graph construction, with its core advantages lying in high accuracy and high performance.

Specifically, CAGRA is a GPU-optimized graph indexing technology proposed by NVIDIA. Its core feature is the use of the NN-Descent (Nearest Neighbor Descent) algorithm for iterative graph construction, followed by multi-round pruning optimization (2-hop detours) to progressively improve the quality of the graph structure, ultimately achieving high-precision search results.

Step 1: Graph Construction Using NN-Descent (Nearest Neighbor Descent)

The core of the NN-Descent (Nearest Neighbor Descent) algorithm is as follows: if node u is a nearest neighbor of node v, and node w is a nearest neighbor of node u, then there is a very high probability that w is also a nearest neighbor of v. This transitivity allows for the efficient discovery of nearest neighbor relationships between nodes.

The graph construction process is as follows:

Random Initialization: Each node randomly selects several neighbors to form the initial graph structure.
Neighbor Expansion: In each iteration, collect the current neighbors of each node and the neighbors of those neighbors to form a pool of candidate neighbors. Calculate the similarity between the candidate nodes and the target node. Different candidate pools can be distributed across different GPU cores for parallel batch processing, ultimately filtering out potential closer neighbors.
Connection Update: If a better neighbor is found, the current, more distant connection is replaced, gradually optimizing the overall structure of the graph.
Convergence Check: When the number of updated connections falls below a threshold, the iteration stops, and the graph structure stabilizes.

As can be seen, in the above process, neighbor expansion and similarity calculation for different nodes are completely independent. We can utilize the GPU’s Thread Block mechanism to allocate independent computational resources to each node, thereby achieving large-scale parallelism.

Step 2: 2-hop detours for graph pruning optimization

After constructing the intermediate graph using the NN-Descent algorithm described above, we typically find that the node degrees are often twice or even higher than the final target degree. This means there are a large number of redundant edges in the graph structure, which requires pruning optimization (2-hop detours).

CAGRA removes redundant edges via the 2-hop detours mechanism. The core idea is as follows: if node A can reach node B indirectly through another neighboring node C (i.e., a path A→C→B exists), and the direct distance from A to B differs only slightly from the indirect distance via A→C→B, then the direct connection between A and B is considered a redundant edge and can be removed.

The advantage of this pruning mechanism lies in the fact that the redundancy determination for each edge relies solely on the distance calculations between its end nodes and their common neighbors. Since there are no cross-edge data dependencies, it can be executed in parallel via GPU batch processing. This reduces the graph’s storage overhead by 40%–50%without sacrificing search accuracy, while simultaneously improving query and navigation speeds.

02 How is CAGRA on Milvus different?

Although GPUs offer significant advantages during the graph indexing phase, in actual production environments, GPU resources are typically more expensive and scarce than CPUs. If both indexing and querying rely on GPUs, this leads to a series of issues:

Low resource utilization (query requests are sporadic, leaving GPUs idle for long periods)

High deployment costs (requiring a GPU for each query service, increasing unnecessary hardware costs)

Limited scalability (the number of GPUs limits the number of service instances)

Lack of flexibility (inability to switch between GPU and CPU on demand)

To address these pain points, the open-source vector database Milvus introduced a flexible deployment option for the GPU-indexed CAGRA in version 2.6.1 via the `adapt_for_cpu` parameter. This enables a hybrid mode where high-quality graph indexes are built on GPUs and queries are executed on CPUs (typically using HNSW), significantly reducing deployment costs while ensuring index quality.This is a highly practical solution for scenarios with low data update frequency (no need for frequent index rebuilding), large-scale querying (requiring a large number of query service instances), and cost sensitivity (desire to reduce GPU resource investment).

(1) Interpretation of `adapt_for_cpu`

Milvus usesadapt_for_cpuparameters to control the serialization and deserialization behavior of CAGRA indexes, enabling flexible switching between build and query devices.

Different combinations of these parameters during the building and loading phases correspond to four core execution logics, covering various business requirements:

It is important to note that this mechanism supports one-way conversion from the CAGRA format to the HNSW format (since the graph structure of CAGRA contains all the nearest-neighbor information required by HNSW), but the HNSW format cannot be converted back to the CAGRA format. Therefore, parameter settings during the building phase must be planned in conjunction with long-term business requirements.

03 Experiments

To validate the effectiveness of the GPU-based construction + CPU-based querying hybrid model, the Milvus team conducted systematic experiments in a standard testing environment, performing comparative analyses across three dimensions: mapping performance, query performance, and recall rate.

Experimental Environment

The experiments utilized industry-standard hardware configurations to ensure the reliability of the results:

CPU: AMD EPYC 7R13 Processor (16 CPUs)

GPU: NVIDIA L4

Comparison 1: Model Training Performance Comparison

The CAGRA graph was constructed on the GPU, while the HNSW graph was constructed on the CPU; the graph degree was 64

Conclusion:

GPU CAGRA’s index construction speed is 12–15 times faster than CPU HNSW, fully demonstrating the significant advantage of GPUs during the graph index construction phase.

As the number of iterations increases, the construction time grows linearly

Comparison 2: Query Performance

The CAGRA graph is built on the GPU, with queries performed on both the CPU and GPU; CPU queries require deserialization to HNSW format first

Conclusion

GPU search achieves approximately 5–6 times the QPS of CPU search.

As the number of iterations increases, recall gradually improves and stabilizes; beyond a certain threshold, further increasing the number of iterations yields little additional benefit.

Comparison 3: Recall Comparison Between CAGRA and HNSW

CAGRA and HNSW queries are performed on the CPU to compare recall rates.

Conclusion: CAGRA’s recall outperforms HNSW on both datasets, indicating that although CAGRA was built on a GPU, deserializing it to a CPU still ensures the quality of the graph.

04 One more thing

Milvus’s hybrid model—GPU graph construction combined with CPU querying—innovatively balances the technical advantages of GPUs with CPU cost control, providing an optimal solution for business scenarios characterized by low data update frequency, large query volumes, and cost sensitivity.

However, when dealing with ultra-large datasets, the industry typically employs Vamana for insert-based graph construction. This approach enables efficient graph index construction even when there is insufficient GPU memory available at any given time. The core idea is to “divide and conquer”: all nodes are divided into several batches, with each batch occupying only a limited amount of working memory, while ensuring the constructed graph structure remains of high quality.

The construction process consists of three steps: first, geometric growth batch partitioning (building the skeleton with small batches in the early stage, enhancing parallelism with medium-sized batches in the middle stage, and filling in details with large batches in the late stage); second, greedy search for node insertion (navigating from a central point to filter out nearest neighbors and expand the scope); and third, reverse edge updates (ensuring graph symmetry and navigability).Pruning is integrated into the graph construction process through real-time filtering based on the α-RNG criterion: if a candidate neighbor v is covered by a selected neighbor p' (d(p',v) < α × d(p,v)), it is pruned; the value of α controls the graph’s sparsity and accuracy. GPU acceleration is achieved through intra-batch parallelism (parallel search and pruning of nodes within the same batch) and geometrically growing batches (balancing quality and parallelism).<>

With this mechanism, we can effectively address the challenges of massive data updates and queries faced by teams dealing with rapidly growing business data. Currently, the Milvus team is working full steam ahead on developing this index, which is expected to be released in the first half of 2026. We welcome your comments and suggestions regarding this feature in the comments section.

This article is from "Zilliz," authored by "Chen Jianlin."

01

Understanding CAGRA

02

How is CAGRA on Milvus different?

(1) Interpretation of `adapt_for_cpu`

03

Experiments

Experimental Environment

Comparison 1: Model Training Performance Comparison

Comparison 2: Query Performance

Comparison 3: Recall Comparison Between CAGRA and HNSW

04

One more thing

More in AI Academy

How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base