Simple baselines for "Learned Indexes" to complement the blog post.
make
gen_lognormal
generates 190M unique 32-bit integers of the form int(1e+9 * x), where x is a sample from an (0, 2) log-normal distribution.
The integers are written to the binary file lognormal.sorted.190M
.
To run our SIMD-enabled bucketized cuckoo hash table hashing baseline:
sh run_hashing.sh lognormal.sorted.190M
Our implementation uses 32-bit integer keys and values. On an Intel Xeon E5-2690v4 CPU (2.6GHz), our table took 36ns per access while wasting 0.015GB of space (1% of slots).
As with the hashing baselines, we use 32-bit integer keys and values. This task makes the additional assumption that the data is sorted by key.
To run our range search baselines:
sh run_range_search.sh lognormal.sorted.190M
This script runs several baselines:
- binary search
- stx::btree, an open-source B-Tree implementation
- a simple read-only B-Tree with a two-level index
- a simple read-only B-Tree with a three-level index
- FAST, a fast SIMD-enabled B-Tree implementation
Our simple read-only B-Trees perform binary search on the topmost level, followed by AVX2 linear search on the subsequent levels until the position of the queried key is found. We found that pages of size 48 work well on our hardware setup; the page size can be tuned for your own hardware by calling the range_search
binary with different arguments.
On an Intel Xeon E5-2690v4 CPU (2.6GHz), we observe the following average query times on the log-normal data:
method | query time (ns) |
---|---|
binary search | 785.7 |
stx::btree | 534.1 |
2-level index | 201.1 |
3-level index | 177.3 |
FAST | 125.7 |
We note that directly comparing these numbers against those reported in the paper should be done with a grain of salt due to unspecified differences in testing environments. In particular, CPU cache sizes have a significant impact on performance for this workload. A second caveat is that the methods evaluated in the paper use binary search on the base data, whereas the simple B-Tree and FAST baselines use vectorized linear search. While it's possible that the learned index approach could be accelerated with the use of SIMD-based linear search on the lowest level, this potential optimization was not evaluated in the paper.