Skip to content

Timing Tables

Ufuk Turuncoglu edited this page Jun 7, 2019 · 4 revisions

Sample Timings

The timing tables below show processor layouts and sample timings from runs of the coupled FV3GFS-MOM6-CICE5 system. Note that the system is largely unoptimized and is not load balanced and threading has not been enabled, so these numbers are only initial baselines that will be improved during the optimization phase.

In the tables below, layout is the domain decomposition of each tile of the FV3 cubed sphere grid. For example, a layout of 8,12 means to decompose each of the 6 tiles into 8x12 chunks. Total FV3 forecast tasks in this case would be 8x12x6=576. Additional tasks are provided to FV3 for the asynchronous write component and these must be included in the total PEs assigned to the atmosphere.

Cheyenne

Used software environment: intel/19.0.2, mpt/2.19, netcdf-mpi/4.6.1, pnetcdf/1.11.0, optimized version of ESMF 8.0.0 Beta Snapshot 32 (compiled with intel/19.0.2 and mpt/2.19).

Run ID REF A50 A100 A100_O50 A100_WR50
Description baseline 50% increase in ATM PET 100% increase in ATM PET 100% increase in ATM PET and 50% increase in OCN PET 50% increase in write tasks
PE Layout NTASKS in ATM/OCN/ICE order 336 360 360 480 360 360 624 360 360 624 540 360 648 360 360
ROOT PE ROOTPE in ATM/OCN/ICE order 0 336 696 0 480 840 0 624 984 0 624 1164 0 648 1008
# Nodes count 30 34 38 43 38
Layout 6,8 6,12 8,12 8,12 8,12
Write Tasks 48 48 48 48 72
Cost pe-hours/simulated_years 134882.51 125445.73 126210.27 139095.31 125343.2
Throughput simulated_years/day 0.19 0.23 0.26 0.27 0.26
Init s 112.092 105.705 113.107 115.48 117.376
Run s 1231.804 1010.844 909.952 886.24 903.7
Finalize s 85.181 87.868 87.472 87.76 48.389
CPL s 55.804 60.557 57.507 60.761 61.33
ATM s 802.943 565.16 466.63 468.548 463.049
ICE s 92.018 103.581 106.74 118.715 105.496
OCN s 251.659 253.444 252.385 216.019 253.800
Relative Speed-up against REF (only run phase) - 0.18 0.26 0.28 0.27

Detailed information about the NCAR's Cheyenne supercomputing system can be found in here.

Stampede2

Used software environment: intel/18.0.2, impi/18.0.2, netcdf/4.6.2 optimized version of ESMF 8.0.0 Beta Snapshot 38 (compiled with intel/18.0.2 and impi/18.0.2).

Run ID REF A50 A100 A100_O50 A100_WR50
Description baseline 50% increase in ATM PET 100% increase in ATM PET 100% increase in ATM PET and 50% increase in OCN PET 50% increase in write tasks
PE Layout NTASKS in ATM/OCN/ICE order 336 360 360 480 360 360 624 360 360 624 540 360 648 360 360
ROOT PE ROOTPE in ATM/OCN/ICE order 0 336 696 0 480 840 0 624 984 0 624 1164 0 648 1008
# Nodes count 22 25 28 32 29
Layout 6,8 6,12 8,12 8,12 8,12
Write Tasks 48 48 48 48 72
Cost pe-hours/simulated_years 123128.03 112318.32 112561.58 116570.2 112596.74
Throughput simulated_years/day 0.21 0.26 0.29 0.32 0.30
Init s 102.997 102.909 99.95 104.418 100.769
Run s 1150.013 923.164 826.039 748.524 797.804
Finalize s 54.516 54.366 53.989 55.755 54.221
CPL s 104.495 20.503 64.245 54.457 102.791
ATM s 747.12 521.465 402.536 403.285 408.365
ICE s 97.894 100.396 99.376 83.09 85.31
OCN s 182.298 181.046 182.321 140.239 182.432
Relative Speedup against REF (only run phase) - 0.20 0.28 0.35 0.31

Detailed information about the TACC's Stampede2 supercomputing system can be found in here.

Configuration

The test case that is used for the benchmark simulations is the 1-day long run with January 2012 initial conditions. The reference simulation can be created using following commands:

./create_newcase --compset UFS_S2S --res C384_t025 --case ufs.s2s.c384_t025.jan --driver nuopc --run-unsupported
cd ufs.s2s.c384_t025.jan/
./case.setup
./xmlchange DOUT_S=FALSE
./xmlchange STOP_N=1
./xmlchange RUN_REFDATE=2012-01-01
./xmlchange RUN_STARTDATE=2012-01-01
./xmlchange JOB_WALLCLOCK_TIME=00:30:00
qcmd -- ./case.build

# Edit user_nl_cice and add following line to use correct ice initial condition.
ice_ic = "$ENV{UGCSINPUTPATH}/cice5_model.res_2012010100.nc"

./case.submit

For example, to create test with different PE Layout configuration (A100_WR50 run in the benchmark tables):

# The following commands doubles number of used PETs for the ATM component

./xmlchange NTASKS_CPL=648
./xmlchange NTASKS_ATM=648
./xmlchange NTASKS_OCN=360
./xmlchange NTASKS_ICE=360
./xmlchange ROOTPE_CPL=0
./xmlchange ROOTPE_ATM=0
./xmlchange ROOTPE_OCN=648
./xmlchange ROOTPE_ICE=1008

# Add following line to user_nl_fv3gfs (in the case directory) to increase used IO tasks %50
write_tasks_per_group = 72

The CIME (Common Infrastructure for Modeling the Earth) calculates and modifies the FV3GFS layout namelist parameter (input.nml) automatically.