Release v0.100 · LLNL/lbann

============================== Release Notes: v0.100 ==============================
Support for new network structures:

3D molecular generation models for Metal Organic Frameworks from the CoRE MOF Database.
3D CosmoFlow Model
DenseNet
ATOM LSTM model
RAS state classifier
node2vec
Transformer and other attention-based models
ExaGAN (formerly CosmoGAN)
MaCC ICF surrogate model

Applications:

Created a directory of example applications, deprecating the "model zoo" directory

Support for new layers:

Embedding layer
Distributed embedding layer
Channel-wise scale/bias layer
Entry-wise scale/bias layer
Gated-Recurrent Units (GRU)
Entry-wise batchnorm
Argmax, Argmin, and one-hot layers
Layer norm
Deconvolution layer (transposed convolution)
Layers for channel-wise operations (channel-wise fully-connected, channel-wise softmax, channel-wise scale/bias, instance norm)
Matrix multiply layer

Python front-end:

Performance optimizations:

Parallelized Python data reader with "multiprocessing" module
Fuse batchnorm stats allreduces in FP/BP.
Tuned concatenate and slice layer
Dynamically allocate and free memory for layer error signals (halves LBANN's memory footprint)

Model portability & usability:

Internal features:

Added support for DistConv features (distributed, generalized,
parallel convolution)
Added support for NVSHMEM 1.0 API (used in distributed embedding
layer and DistConv halo exchange)
Support for multiple data types per model (per-layer)
Support for per-layer mixed-precision weight training and inference,
includes per-weight object and objective function mixed-precision.
Improved how and when the RNGs are initialized
Callback to dump images to TensorBoard
Callback to save model weights (useful to export to PyTorch)
Callback to save top K models (LTFB)
Improved run-to-run reproducibility by initializing weights in alphabetical order
Moved models from model_zoo directory to applications directory
Cleanup and refactoring of callbacks and layer instantiation
Grouped batchnorm statistics
Callback to print model description
Refactored trainer and training-state out of the model class
Support for transposing data in matrix multiply layers
Added DiHydrogen tensor and DistConv library
Added parallel strategy to layer class to support DistConv
LBANN inference mode supports loading models from multiple directories
Cleanup of checkpoint and restart logic

I/O & data readers:

Added in-memory data store that caches samples in CPU memory. It can be loaded
during the first epoch or preloaded
Added new "transform" data preprocessing ingestion pipeline
Added sample list format for specifying data sets
Introduced data coordinator that manages data readers and extracts them from
the input layers
Data store is able to checkpoint / spill it's contents to local disk
Data reader for SMILE strings

Build system:

Bug fixes:

Fixed path resolution for dump weights, save model, and checkpoint callbacks
Added mutexes for preloading the data store
Fixed the LTFB exchange to include all ADAM optimizer state
Fixed the mapping of I/O RNGs to I/O processing threads to ensure
consistent and correct multi-threaded performance

Retired features:

Provide feedback