Skip to content

Commit

Permalink
nits
Browse files Browse the repository at this point in the history
one spelling fix, use single-backticks instead of double-, one copy tweak
  • Loading branch information
ryan-williams committed Oct 15, 2024
1 parent 7669e1f commit a4128de
Showing 1 changed file with 6 additions and 7 deletions.
13 changes: 6 additions & 7 deletions docs/performance_tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The "best" configuration for any given model will depend on a number of variable

General guidelines, approximately ordered by magnitude of impact:

* Read-ahead, i.e., `use_eager_fetch=True`, will improve throughput in any given epic, as the loader will attempt to always have an I/O buffer available. This requires additional memory (in the worst case, it will approximately double memory use). Read-ahead is enabled by default.
* Read-ahead, i.e., `use_eager_fetch=True`, will improve throughput in any given epoch, as the loader will attempt to always have an I/O buffer available. This requires additional memory (in the worst case, it will approximately double memory use). Read-ahead is enabled by default.
* When using the PyTorch DataLoader, there is normally benefit from multiple workers per GPU (controlled with the DataLoader `num_workers` parameter). In most environments, 1-2 workers per GPU is a good starting point, and exceeding 2 may decrease performance due to over-subscription of CPU and RAM on the host.
* If using shuffling (enabled by default): increasing the `shuffle_chunk_size` (default value 64) will increase read performance and decrease randomness.
* The `tiledbsoma` I/O buffer size, set by the `SOMATileDBContext` parameter `soma.init_buffer_bytes`, will decrease performance if it becomes too small. The default is 1GiB and should be sufficient for most workloads. Experimenting with larger values (e.g., 2GiB) may show modest improvements in throughput.
Expand All @@ -19,16 +19,16 @@ General guidelines, approximately ordered by magnitude of impact:

TileDB and TileDB-SOMA provide control over the physical data layout, and specifying a layout tuned for your query pattern can improve performance for that specific workload. The default SOMA data layout is intended to achieve reasonable performance for a wide variety of access patterns, and is not optimal for use with the data loader.

In most SOMA Experiments, the primary performance bottleneck is reading from the ``X`` matrix. The following guidelines are designed to reduce unutilized I/O:
In most SOMA Experiments, the primary performance bottleneck is reading from the `X` matrix. The following guidelines are designed to reduce unutilized I/O:

* On the ``obs`` (first) dimension, specify space tile extents matching your expected shuffle chunk size.
* On the ``var`` (second) dimension, specify space tile extents matching your expected axis slice.
* On the `obs` (first) dimension, specify space tile extents matching your expected shuffle chunk size.
* On the `var` (second) dimension, specify space tile extents matching your expected axis slice.
* Specify a tile capacity which is sufficiently large to achieve good compression in the filter pipeline, but not so large as to increase unnecessary I/O. A good starting point is 64K-128K (the default is currently 100,000).

For example:

* If you typically use a shuffle chunk of 32, set the first dimension space tile to 16 or 32.
* If you typically read most ``var`` values, set the second dimension space tile to a large value (e.g., 4096 or 8192). If you always read _all_ ``var`` columns you could even set the tile size of the entire dimension extent.
* If you typically read most `var` values, set the second dimension space tile to a large value (e.g., 4096 or 8192). If you always read _all_ `var` columns you could even set the tile size of the entire dimension extent.

In addition, you may find that increasing the compression level in the filter pipeline may provide a modest improvement in read throughput (at the cost of write throughput when creating the array).

Expand All @@ -50,7 +50,6 @@ tiledbsoma.SparseNDArray.create(
},
context=a_soma_tiledb_context
)

```

For general information on performance tuning, refer to general background on [TileDB performance tuning](https://docs.tiledb.com/main/how-to/performance/performance-tips).
Expand All @@ -63,6 +62,6 @@ TileDB arrays are commonly comprised of multiple array fragments, representing s

General guidelines:

* If data is resident on network storage (e.g., AWS EBS or S3), ensure ample network bandwidth. For example, a `n` instance type (e.g., ``g4dn``) will typically provide significantly more throughput.
* If data is resident on network storage (e.g., AWS EBS or S3), ensure ample network bandwidth. For example, `*n` instance types (e.g., `g4dn`) will typically provide significantly more throughput.
* Data loading requires sufficient CPU & RAM for the TileDB engine. While requirements vary by dataset, suggest at least 8 CPUs and 16-24 GiB of RAM per host GPU.
* On AWS, benchmarking has shown that loading of S3-resident data is approximately as performant as EBS-resident data. Using a faster storage system (e.g., local nvRAM ephemeral storage) may increase throughput somewhat, at the cost of additional data logistics.

0 comments on commit a4128de

Please sign in to comment.