You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
In a separate test of CAGRA index building with IVF-PQ for KNN graph, I found a certain batch size for GIST dataset has much faster index build latency. This is mainly due to compute_similarity_kernel which is more than 3x faster using that batch size compared to other sizes. This issue is reproducible with a standalone IVF-PQ search example using a randomly generated dataset. The dataset is also used as the query set to mimic the behavior of CAGRA index build. The parameters used in the following code are chosen based on CAGRA index building with IVF-PQ KNN construction and the dataset dimension/size of GIST.
Steps/Code to reproduce bug
In the examples, add the following code as a new example. Add the corresponding cmake commands for build.
#include "common.cuh"
#include <raft/core/device_mdarray.hpp>
#include <raft/core/device_resources.hpp>
#include <cuvs/neighbors/ivf_pq.hpp>
#include <cuvs/neighbors/refine.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>
#include <rmm/mr/device/pool_memory_resource.hpp>
#include <cstdint>
void ivf_pq_build_search(raft::device_resources const& dev_resources,
raft::device_matrix_view<const float, int64_t> dataset,
const int64_t batch_size)
{
using namespace cuvs::neighbors; // NOLINT
std::cout << "================================" << std::endl;
std::cout << "Searching with batch size " << batch_size << std::endl;
ivf_pq::index_params index_params;
index_params.metric = cuvs::distance::DistanceType::L2Expanded;
// Same setup as CAGRA index building using IVF-PQ for KNN graph
// https://github.com/rapidsai/cuvs/blob/7b879116684501f36ca5a19a74c01fcecb52e962/cpp/src/neighbors/ivf_pq_index.cpp#L20
index_params.n_lists =
dataset.extent(0) < 4 * 2500 ? 4 : static_cast<uint32_t>(std::sqrt(dataset.extent(0)));
index_params.n_lists = std::min<uint32_t>(index_params.n_lists, dataset.extent(0));
index_params.pq_dim =
raft::round_up_safe(static_cast<uint32_t>(dataset.extent(1) / 4), static_cast<uint32_t>(8));
if (index_params.pq_dim == 0) index_params.pq_dim = 8;
index_params.pq_bits = 8;
index_params.kmeans_trainset_fraction = dataset.extent(0) < 10000 ? 1 : 0.1;
std::cout << "Building IVF-PQ index" << std::endl;
auto index = ivf_pq::build(dev_resources, index_params, dataset);
std::cout << "Number of clusters " << index.n_lists() << ", number of vectors added to index "
<< index.size() << std::endl;
// Set search parameters.
// Same setup as CAGRA index building using IVF-PQ for KNN graph
// https://github.com/rapidsai/cuvs/blob/branch-24.12/cpp/src/neighbors/detail/cagra/cagra_build.cpp
ivf_pq::search_params search_params;
search_params.n_probes = std::max<uint32_t>(10, index_params.n_lists * 0.01);
search_params.internal_distance_dtype = CUDA_R_16F;
search_params.lut_dtype = CUDA_R_16F;
// Create output arrays.
// Assuming intermediate_degree = 128. In CAGRA build implementation, topk = intermediate_degree + 1.
int64_t topk = 129;
int64_t n_queries = dataset.extent(0);
auto neighbors = raft::make_device_matrix<int64_t>(dev_resources, n_queries, topk);
auto distances = raft::make_device_matrix<float>(dev_resources, n_queries, topk);
int64_t num_batch = ceil(1.0 * n_queries / batch_size);
// Search K nearest neighbors for each of the queries.
auto start = std::chrono::high_resolution_clock::now();
for (int64_t i = 0; i < num_batch; i++) {
int64_t curr_batch_size = std::min(batch_size, (int64_t)n_queries - i * batch_size);
auto curr_queries = raft::make_device_matrix_view<const float, int64_t>(dataset.data_handle() + i * batch_size * dataset.extent(1), curr_batch_size, dataset.extent(1));
auto curr_neighbors = raft::make_device_matrix_view<int64_t, int64_t>(neighbors.data_handle() + i * batch_size * neighbors.extent(1), curr_batch_size, neighbors.extent(1));
auto curr_distances = raft::make_device_matrix_view<float, int64_t>(distances.data_handle() + i * batch_size * distances.extent(1), curr_batch_size, distances.extent(1));
ivf_pq::search(
dev_resources, search_params, index, curr_queries, curr_neighbors, curr_distances);
}
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Search elapsed time: " << elapsed.count() << " milliseconds\n";
// Not doing refinement as trying to investigate search performance in this example.
}
int main()
{
raft::device_resources dev_resources;
// Set pool memory resource with 1 GiB initial pool size. All allocations use the same pool.
rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource> pool_mr(
rmm::mr::get_current_device_resource(), 1024 * 1024 * 1024ull);
rmm::mr::set_current_device_resource(&pool_mr);
// Create input arrays. Same degree as GIST 1M dataset.
int64_t n_samples = 1000000;
int64_t n_dim = 960;
auto dataset = raft::make_device_matrix<float, int64_t>(dev_resources, n_samples, n_dim);
// queries is not used.
auto queries = raft::make_device_matrix<float, int64_t>(dev_resources, 100, n_dim);
generate_dataset(dev_resources, dataset.view(), queries.view());
ivf_pq_build_search(dev_resources,
raft::make_const_mdspan(dataset.view()),
2084);
ivf_pq_build_search(dev_resources,
raft::make_const_mdspan(dataset.view()),
8334);
ivf_pq_build_search(dev_resources,
raft::make_const_mdspan(dataset.view()),
33334);
}
Compile the example and run. A sample output (tested on L40S) looks like the following. Note that batch size 8334 is more than 3x faster than the other two batch sizes. Since dataset dimension is 960 and data type is 4B, 8334 implies around 32MB data chunk. Similarly, 2084 represents 8MB and 33334 represents 128MB.
================================
Searching with batch size 2084
Building IVF-PQ index
using ivf_pq::index_params nrows 1000000, dim 960, n_lits 1000, pq_dim 240
Number of clusters 1000, number of vectors added to index 1000000
Search elapsed time: 17263 milliseconds
================================
Searching with batch size 8334
Building IVF-PQ index
using ivf_pq::index_params nrows 1000000, dim 960, n_lits 1000, pq_dim 240
Number of clusters 1000, number of vectors added to index 1000000
Search elapsed time: 4890 milliseconds
================================
Searching with batch size 33334
Building IVF-PQ index
using ivf_pq::index_params nrows 1000000, dim 960, n_lits 1000, pq_dim 240
Number of clusters 1000, number of vectors added to index 1000000
Search elapsed time: 15240 milliseconds
Expected behavior
Should expect similar search latency for different of batch size.
Environment details (please complete the following information):
Describe the bug
In a separate test of CAGRA index building with IVF-PQ for KNN graph, I found a certain batch size for GIST dataset has much faster index build latency. This is mainly due to
compute_similarity_kernel
which is more than 3x faster using that batch size compared to other sizes. This issue is reproducible with a standalone IVF-PQ search example using a randomly generated dataset. The dataset is also used as the query set to mimic the behavior of CAGRA index build. The parameters used in the following code are chosen based on CAGRA index building with IVF-PQ KNN construction and the dataset dimension/size of GIST.Steps/Code to reproduce bug
In the examples, add the following code as a new example. Add the corresponding cmake commands for build.
Compile the example and run. A sample output (tested on L40S) looks like the following. Note that batch size 8334 is more than 3x faster than the other two batch sizes. Since dataset dimension is 960 and data type is 4B, 8334 implies around 32MB data chunk. Similarly, 2084 represents 8MB and 33334 represents 128MB.
Expected behavior
Should expect similar search latency for different of batch size.
Environment details (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: