Perlmutter scheduling #775

burlen · 2023-08-28T18:48:26Z

Profiling work to determine the best way to use a single Perlmutter GPU node given the interplay between MPI, threads, CUDA, and NetCDF/HDF5.

resolves #772
resolves #769

burlen · 2023-08-29T21:57:18Z

this figure was made before the streaming bug was fixed!

burlen · 2023-08-29T23:21:58Z

this figure was made before the streaming bug was fixed!

add an algorithm property that allows threads in the thread pool to inherit their device assignment from the down stream. This reduces inter device data movement when chaining thread pools.

burlen · 2023-08-30T23:14:31Z

1 node, 1 mpi rank, vary the number of threads. Take away: 2 writer threads, 4 reduce threads was the best

burlen · 2023-08-30T23:17:23Z

for 1 rank, with best threading configuration vary the reduce stream size from 2 to N. Takeaway: slightly better with a stream size of 8. Similar tests on the writer showed stream size didn't make any difference.

burlen · 2023-08-30T23:24:22Z

1 node. Using best threading and stream size from above, vary MPI ranks. Take away: 16 MPI ranks per node was the best. This is a GPU partition node, with a single CPU socket and 4 NIC. the CPU partiton node has 2 CPU sockets and 1 NIC. Results may differ

was forwarding to teca_algorithm which resulted in none of the threading related properties being picked up from the command line.

This fixes a bug introduced in 7120ecb. There the early termination criteria was dropped from the loop that scans for completed work. Early termination is the basis for streaming and without it we were waiting for all work to complete before returning effictively disabling streaming.

when the requested the numebr of threads is less than -1, use at most this many threads. fewer may be used if there are insufficient cores on the node.

burlen · 2023-08-31T17:03:57Z

right: 1 node, 1 GPU, vary threads. left: CPU only. Take away: 2 wri threads, 4 reduce threads are best. same as cpu only

don't issue the warning when MPI is used

burlen · 2023-08-31T22:10:51Z

Comparing NVIDIA MPS
https://docs.nersc.gov/systems/perlmutter/running-jobs/#oversubscribing-gpus-with-cuda-multi-process-service
Take away: MPS only helps above 2 ranks per device. Below that it didn't help

burlen · 2023-08-31T22:46:44Z

comparing CPU to GPu on a single node. In the blue all ranks used a GPU. In the cyan usage was limited to 2 ranks per GPU, above 8 ranks CPU's were also used. In the red, CPU only. 2 writer threads. 4 reduce threads. stream size 8.

above 16 ranks, the number of threads are reduced (automatically) to avoid over subscription. at 32 ranks 2 writer threads, 2 reduce threads. at 64 ranks 1 writer thread, 1 reduce thread.

burlen · 2023-08-31T23:12:06Z

@amandasd merged to your branch. some critical fixes here

burlen added 2 commits August 28, 2023 11:41

test temporal reduction steps_per_request command line argument

3b277b3

rename temporal reduction in memory test

abcc371

burlen changed the base branch from develop to temporal_reduction_multiple_steps_per_request August 28, 2023 18:48

This was linked to issues Aug 28, 2023

threaded_algorithm missing command line property setters #772

Open

cf_writer should set/get command line options from the threaded_algrotihm #769

Open

burlen force-pushed the perlmutter_scheduling branch from 55c34f5 to c80b3f1 Compare August 28, 2023 19:09

threaded_algorithm expose ranks_per_device in API

81d4e2d

burlen force-pushed the perlmutter_scheduling branch from c80b3f1 to aafdcc2 Compare August 29, 2023 21:30

burlen force-pushed the perlmutter_scheduling branch from aafdcc2 to 4e90eb2 Compare August 30, 2023 16:21

burlen added 2 commits August 30, 2023 09:52

threaded_algorithm propagate_device_assignment

af1592a

add an algorithm property that allows threads in the thread pool to inherit their device assignment from the down stream. This reduces inter device data movement when chaining thread pools.

threaded_algorithm fix set algorithm props from command line

80820ef

burlen force-pushed the perlmutter_scheduling branch from 4e90eb2 to 812a601 Compare August 30, 2023 16:53

burlen changed the title ~~WIP -- Perlmutter scheduling~~ Perlmutter scheduling Aug 30, 2023

burlen force-pushed the perlmutter_scheduling branch from bf05a56 to ebea8ff Compare August 30, 2023 19:42

burlen added 7 commits August 31, 2023 09:22

test add test for cpp_temporal_reduciton w. io

80a0159

cf_writer fix let threaded_algorithm process command line

3d5d4db

was forwarding to teca_algorithm which resulted in none of the threading related properties being picked up from the command line.

descriptive_statistics remove debuging code

bd32184

cpu_thread_pool fix bind argument position

220587a

thread_util report num threads when not binding

c970444

thread_util clamp the number of threads

1d5f415

when the requested the numebr of threads is less than -1, use at most this many threads. fewer may be used if there are insufficient cores on the node.

burlen force-pushed the perlmutter_scheduling branch from ebea8ff to 1d5f415 Compare August 31, 2023 16:33

burlen added 2 commits August 31, 2023 11:52

cuda_util simplify device assignment

7549e88

thread_util warn about too few threads wo MPI

fa1c209

don't issue the warning when MPI is used

burlen merged commit 90500dc into temporal_reduction_multiple_steps_per_request Aug 31, 2023
1 check passed

burlen deleted the perlmutter_scheduling branch August 31, 2023 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perlmutter scheduling #775

Perlmutter scheduling #775

burlen commented Aug 28, 2023 •

edited

Loading

burlen commented Aug 29, 2023 •

edited

Loading

burlen commented Aug 29, 2023 •

edited

Loading

burlen commented Aug 30, 2023 •

edited

Loading

burlen commented Aug 30, 2023 •

edited

Loading

burlen commented Aug 30, 2023 •

edited

Loading

burlen commented Aug 31, 2023

burlen commented Aug 31, 2023 •

edited

Loading

burlen commented Aug 31, 2023

burlen commented Aug 31, 2023

Perlmutter scheduling #775

Perlmutter scheduling #775

Conversation

burlen commented Aug 28, 2023 • edited Loading

burlen commented Aug 29, 2023 • edited Loading

this figure was made before the streaming bug was fixed!

this figure was made before the streaming bug was fixed!

burlen commented Aug 29, 2023 • edited Loading

this figure was made before the streaming bug was fixed!

this figure was made before the streaming bug was fixed!

burlen commented Aug 30, 2023 • edited Loading

burlen commented Aug 30, 2023 • edited Loading

burlen commented Aug 30, 2023 • edited Loading

burlen commented Aug 31, 2023

burlen commented Aug 31, 2023 • edited Loading

burlen commented Aug 31, 2023

burlen commented Aug 31, 2023

burlen commented Aug 28, 2023 •

edited

Loading

burlen commented Aug 29, 2023 •

edited

Loading

burlen commented Aug 29, 2023 •

edited

Loading

burlen commented Aug 30, 2023 •

edited

Loading

burlen commented Aug 30, 2023 •

edited

Loading

burlen commented Aug 30, 2023 •

edited

Loading

burlen commented Aug 31, 2023 •

edited

Loading