Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] much faster IVF-PQ search kernels for a certain search batch size #475

Open
abc99lr opened this issue Nov 18, 2024 · 0 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@abc99lr
Copy link
Contributor

abc99lr commented Nov 18, 2024

Describe the bug
In a separate test of CAGRA index building with IVF-PQ for KNN graph, I found a certain batch size for GIST dataset has much faster index build latency. This is mainly due to compute_similarity_kernel which is more than 3x faster using that batch size compared to other sizes. This issue is reproducible with a standalone IVF-PQ search example using a randomly generated dataset. The dataset is also used as the query set to mimic the behavior of CAGRA index build. The parameters used in the following code are chosen based on CAGRA index building with IVF-PQ KNN construction and the dataset dimension/size of GIST.

Steps/Code to reproduce bug
In the examples, add the following code as a new example. Add the corresponding cmake commands for build.

#include "common.cuh"

#include <raft/core/device_mdarray.hpp>
#include <raft/core/device_resources.hpp>
#include <cuvs/neighbors/ivf_pq.hpp>
#include <cuvs/neighbors/refine.hpp>

#include <rmm/mr/device/device_memory_resource.hpp>
#include <rmm/mr/device/pool_memory_resource.hpp>

#include <cstdint>

void ivf_pq_build_search(raft::device_resources const& dev_resources,
                         raft::device_matrix_view<const float, int64_t> dataset,
                         const int64_t batch_size)
{
  using namespace cuvs::neighbors;  // NOLINT
  std::cout << "================================" << std::endl;
  std::cout << "Searching with batch size " << batch_size << std::endl;

  ivf_pq::index_params index_params;
  index_params.metric                   = cuvs::distance::DistanceType::L2Expanded;
  // Same setup as CAGRA index building using IVF-PQ for KNN graph 
  // https://github.com/rapidsai/cuvs/blob/7b879116684501f36ca5a19a74c01fcecb52e962/cpp/src/neighbors/ivf_pq_index.cpp#L20
  index_params.n_lists =
    dataset.extent(0) < 4 * 2500 ? 4 : static_cast<uint32_t>(std::sqrt(dataset.extent(0)));
  index_params.n_lists = std::min<uint32_t>(index_params.n_lists, dataset.extent(0));
  index_params.pq_dim =
    raft::round_up_safe(static_cast<uint32_t>(dataset.extent(1) / 4), static_cast<uint32_t>(8));
  if (index_params.pq_dim == 0) index_params.pq_dim = 8;
  index_params.pq_bits                  = 8;
  index_params.kmeans_trainset_fraction = dataset.extent(0) < 10000 ? 1 : 0.1;

  std::cout << "Building IVF-PQ index" << std::endl;
  auto index = ivf_pq::build(dev_resources, index_params, dataset);

  std::cout << "Number of clusters " << index.n_lists() << ", number of vectors added to index "
            << index.size() << std::endl;

  // Set search parameters.
  // Same setup as CAGRA index building using IVF-PQ for KNN graph 
  // https://github.com/rapidsai/cuvs/blob/branch-24.12/cpp/src/neighbors/detail/cagra/cagra_build.cpp
  ivf_pq::search_params search_params;
  search_params.n_probes = std::max<uint32_t>(10, index_params.n_lists * 0.01);
  search_params.internal_distance_dtype = CUDA_R_16F;
  search_params.lut_dtype               = CUDA_R_16F;

  // Create output arrays.
  // Assuming intermediate_degree = 128. In CAGRA build implementation, topk = intermediate_degree + 1. 
  int64_t topk      = 129; 
  int64_t n_queries = dataset.extent(0);
  auto neighbors    = raft::make_device_matrix<int64_t>(dev_resources, n_queries, topk);
  auto distances    = raft::make_device_matrix<float>(dev_resources, n_queries, topk);

  int64_t num_batch = ceil(1.0 * n_queries / batch_size);
  // Search K nearest neighbors for each of the queries.
  auto start = std::chrono::high_resolution_clock::now();
  for (int64_t i = 0; i < num_batch; i++) {
    int64_t curr_batch_size = std::min(batch_size, (int64_t)n_queries - i * batch_size);
    auto curr_queries = raft::make_device_matrix_view<const float, int64_t>(dataset.data_handle() + i * batch_size * dataset.extent(1), curr_batch_size, dataset.extent(1));
    auto curr_neighbors = raft::make_device_matrix_view<int64_t, int64_t>(neighbors.data_handle() + i * batch_size * neighbors.extent(1), curr_batch_size, neighbors.extent(1));
    auto curr_distances = raft::make_device_matrix_view<float, int64_t>(distances.data_handle() + i * batch_size * distances.extent(1), curr_batch_size, distances.extent(1));
    ivf_pq::search(
      dev_resources, search_params, index, curr_queries, curr_neighbors, curr_distances);
  }

  auto end = std::chrono::high_resolution_clock::now();
  auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
  std::cout << "Search elapsed time: " << elapsed.count() << " milliseconds\n";

  // Not doing refinement as trying to investigate search performance in this example. 
}

int main()
{
  raft::device_resources dev_resources;

  // Set pool memory resource with 1 GiB initial pool size. All allocations use the same pool.
  rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource> pool_mr(
    rmm::mr::get_current_device_resource(), 1024 * 1024 * 1024ull);
  rmm::mr::set_current_device_resource(&pool_mr);

  // Create input arrays. Same degree as GIST 1M dataset. 
  int64_t n_samples = 1000000;
  int64_t n_dim     = 960;
  auto dataset      = raft::make_device_matrix<float, int64_t>(dev_resources, n_samples, n_dim);
  // queries is not used. 
  auto queries      = raft::make_device_matrix<float, int64_t>(dev_resources, 100, n_dim);
  generate_dataset(dev_resources, dataset.view(), queries.view());

  ivf_pq_build_search(dev_resources,
                      raft::make_const_mdspan(dataset.view()),
                      2084);
  ivf_pq_build_search(dev_resources,
                      raft::make_const_mdspan(dataset.view()),
                      8334);
  ivf_pq_build_search(dev_resources,
                      raft::make_const_mdspan(dataset.view()),
                      33334);
}

Compile the example and run. A sample output (tested on L40S) looks like the following. Note that batch size 8334 is more than 3x faster than the other two batch sizes. Since dataset dimension is 960 and data type is 4B, 8334 implies around 32MB data chunk. Similarly, 2084 represents 8MB and 33334 represents 128MB.

================================
Searching with batch size 2084
Building IVF-PQ index
using ivf_pq::index_params nrows 1000000, dim 960, n_lits 1000, pq_dim 240
Number of clusters 1000, number of vectors added to index 1000000
Search elapsed time: 17263 milliseconds
================================
Searching with batch size 8334
Building IVF-PQ index
using ivf_pq::index_params nrows 1000000, dim 960, n_lits 1000, pq_dim 240
Number of clusters 1000, number of vectors added to index 1000000
Search elapsed time: 4890 milliseconds
================================
Searching with batch size 33334
Building IVF-PQ index
using ivf_pq::index_params nrows 1000000, dim 960, n_lits 1000, pq_dim 240
Number of clusters 1000, number of vectors added to index 1000000
Search elapsed time: 15240 milliseconds

Expected behavior
Should expect similar search latency for different of batch size.

Environment details (please complete the following information):

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]: Bare-metal
  • Method of RAFT install: [conda, Docker, or from source]: from source, based on commit
commit 7b879116684501f36ca5a19a74c01fcecb52e962 (upstream/branch-24.12)
Date:   Fri Nov 15 16:12:42 2024 -0600

Additional context
Add any other context about the problem here.

@abc99lr abc99lr added the bug Something isn't working label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants