Version 0.7 July 16, 2020
Points of contact: David Kanter ([email protected]), Steve Farrell ([email protected])
All rules are taken from the MLPerf Training Rules except for those that are overridden here.
The MLPerf name and logo are trademarks. In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. The MLPerf organization reserves the right to solely determine if a use of its name or logo is acceptable.
The benchmark suite consists of the benchmarks shown in the following table.
Problem |
Dataset |
Quality Target |
Climate segmentation |
CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) |
IOU 0.82 |
Cosmological parameter prediction |
CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets |
Mean average error 0.124 |
Modeling catalysts |
Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set |
Forces mean absolute error 0.036 |
There are two divisions of the HPC benchmark suite, the Closed division and the Open division.
The Closed division requires using the same preprocessing, model, and training method as the reference implementation.
The closed division models are:
Problem |
Model |
Climate segmentation |
|
Cosmological parameter prediction |
|
Modeling catalysts |
https://github.com/sparticlesteve/ocp/tree/mlperf-hpc-reference |
Each reference implementation includes a download script or broadly available method to acquire and verify the dataset.
The data at the start of the benchmark run should reside on a parallel file system that is persistent (>= 1 month, not subject to eviction by other users), can be downloaded to / accessed by the user, and can be shared among users at the facility. Any staging to node-local disk or memory or system burst buffer should be included in the benchmark time measurement.
You must flush/reset the on-node caches prior to running each instance of the benchmark. Due to practicality issues, you are not required to reset off-node system-level caches.
We otherwise follow the training rule on consistency with the reference implementation preprocessing and allowance for reformatting.
CLOSED:
Allowed hyperparameter and optimizer settings are specified here. For anything not explicitly mentioned here, submissions must match the behavior and settings of the reference implementations.
Model |
Name |
Constraint |
Definition |
Reference Code |
CosmoFlow |
global_batch_size |
unconstrained |
the global batch size for training |
local |
CosmoFlow |
opt_name |
"sgd" |
the optimizer name |
|
CosmoFlow |
sgd_opt_momentum |
0.9 |
SGD momentum |
|
CosmoFlow |
opt_base_learning_rate |
unconstrained |
The base learning rate |
|
CosmoFlow |
opt_learning_rate_warmup_epochs |
unconstrained |
the number of epochs for learning rate to warm up to base value |
|
CosmoFlow |
opt_learning_rate_warmup_factor |
unconstrained |
the constant factor applied at learning rate warm up |
scaled learning rate / |
CosmoFlow |
opt_learning_rate_decay_boundary_epochs |
list of positive integers |
Epochs at which learning rate decays |
|
CosmoFlow |
opt_learning_rate_decay_factor |
|
the learning rate decay factor(s) at the decay boundary epochs |
|
CosmoFlow |
dropout |
|
Dropout regularization probability for the dense layers |
|
CosmoFlow |
opt_weight_decay |
|
L2 regularization parameter for the dense layers |
|
DeepCAM |
global_batch_size |
unconstrained |
the global batch size for training |
|
DeepCAM |
batchnorm_group_size |
|
Determines how many ranks participate in the batchnorm |
|
DeepCAM |
opt_name |
Adam, AdamW, or LAMB |
the optimizer name |
|
DeepCAM |
opt_eps |
1e-6 |
epsilon for Adam |
|
DeepCAM |
opt_betas |
unconstrained |
Momentum terms for Adam-type optimizers |
|
DeepCAM |
opt_weight_decay |
|
L2 weight regularization |
|
DeepCAM |
opt_lr |
unconstrained |
the base learning rate |
|
DeepCAM |
scheduler_lr_warmup_steps |
|
the number of epochs for learning rate to warm up to base value |
|
DeepCAM |
scheduler_lr_warmup_factor |
|
When warmup is used, the target learning_rate will be lr_warmup_factor * start_lr |
|
DeepCAM |
scheduler_type |
multistep or cosine_annealing |
Specifies the learning rate schedule |
|
DeepCAM |
scheduler_milestones |
unconstrained |
If multistep, the steps at which learning rate is decayed |
milestones in |
DeepCAM |
scheduler_decay_rate |
unconstrained |
If multistep, the learning rate decay factor |
decay_rate in |
DeepCAM |
scheduler_t_max |
|
For cosine_annealing, period length in steps |
|
DeepCAM |
scheduler_eta_min |
|
For cosine_annealing, sets the minimal LR |
|
DeepCAM |
gradient_accumulation_frequency |
|
Specifies the number of gradient accumulation steps before a weight update is performed |
|
OpenCatalyst |
global_batch_size |
|
the global batch size |
|
OpenCatalyst |
opt_name |
AdamW |
the optimizer name |
config setting |
OpenCatalyst |
opt_base_learning_rate |
|
the base learning rate |
config setting |
OpenCatalyst |
opt_learning_rate_warmup_steps |
|
the number of steps for learning rate to warm up to base value |
|
OpenCatalyst |
opt_learning_rate_warmup_factor |
|
the factor applied to the learning rate at the start of warmup |
|
OpenCatalyst |
opt_learning_rate_decay_boundary_steps |
list of positive integers |
|
OpenCatalyst |
OPEN: Hyperparameters and optimizer may be freely changed.
MLPerf HPC submissions consist of the following two metrics: metrics 1 is considered mandatory for a complete submission whereas metric 2 is considered optional:
This is a mandatory metric: see MLPerf Training Rule 11 for reference. The same rules apply here.
This is an optional metric. It was designed to test the training capacity of a system.
Measurement: we will define 3 important parameters first.
-
number of models M: number of model instances which are going to be trained in this benchmark.
-
instance scale S: each individual model instance will be trained at this scale.
-
total utilized scale T: the total scale used for running this benchmark. For example, if all M models are trained concurrently, then T=M*S. More generally we can write that S⇐T⇐M*S if (some of) the models are trained sequentially.
Notes:
-
All three numbers M,S,T are chosen by the submitter. This allows the submitter to accomodate their submission to available machine resources, i.e. compute capacity and compute time.
-
S and T should be in units of compute resources, e.g. nodes, GPUs or other accelerators. This choice should be aligned with the HPC system description. For example, if the systems descriptions table lists number GPUs to define the scale of the system, then S should be specified in numbers of GPUs.
-
S and T can be chosen independently of the submission for metric 1 (strong scaling). We encourage to choose T as large as possible, ideally full system scale, but this is not required.
The submitter then trains M models on the resource partitioning (S,T) as defined above to convergence.
We define a Time-To-Train-all (TTTa) number by computing the difference between the end time of the instance which needs longest time to converge and the start time of the instance which starts up fastest. Mathematically this can be expressed as
TTTa = max(run_stop) - min(run_start) where the max/min are taken over all instances M.
Note: the submitter is allowed to prune this number by removing results from individual training instances. As long as the minimum number of models rule is satisfied (see section Benchmark Results below), the submission is valid. They then use a modified number of models M'⇐M and computes TTTa over the reduced set. This allows the submitter to remove occasional outliers or stragglers which would otherwise reduce the score disproportionally.
Reporting: the submitter reports the the tuple (T, S, M', TTTa). It is required to submit a separate MLLOG file for each of the training instances, so that reviewers can verify the quoted numbers. It is not allowed to merge logging files for individual instances.
Restrictions:
-
The submitter must not report this score on its own. It has to be reported in conjunction with at least one score from Strong Scaling (Time to Convergence) from the same benchmark.
-
this score does not allow for extrapolation. All reported M' training instances must have converged and it is not allowed to extrapolate results in S or T.
We follow the MLPerf Training Rule 11 along with the following required number of runs per benchmark. Note that since run-to-run variability is already captured by spatial multiplexing in case of metric 3, we use the adjusted requirement that the number of trained instances has to be at least equal to the number of runs for metric 1 and 2.
Benchmark |
Number of Runs (Metric 1, 2) |
M' (Metric 3) |
DeepCAM |
5 |
>=5 |
CosmoFlow |
10 |
>=10 |
OpenCatalyst |
5 |
>=5 |