Releases: Lightning-AI/pytorch-lightning
Bug fixing
DDP and Checkpoint bug fixes
Overview
As we continue to strengthen the codebase with more tests, we’re finally getting rid of annoying bugs that have been around for a bit now. Mostly around the inconsistent checkpoint and early stopping behaviour (amazing work @awaelchli @jeremyjordan )
Noteworthy changes:
- Fixed TPU flag parsing
- fixed average_precision metric
- all the checkpoint issues should be gone now (including backward support for old checkpoints)
- DDP + loggers should be fixed
Detail changes
Added
- Added TorchText support for moving data to GPU (#2379)
Changed
- Changed epoch indexing from 0 instead of 1 (#2289)
- Refactor Model
backward
(#2276) - Refactored
training_batch
+ tests to verify correctness (#2327, #2328) - Refactored training loop (#2336)
- Made optimization steps for hooks (#2363)
- Changed default apex level to 'O2' (#2362)
Removed
- Moved
TrainsLogger
to Bolts (#2384)
Fixed
- Fixed parsing TPU arguments and TPU tests (#2094)
- Fixed number batches in case of multiple dataloaders and
limit_{*}_batches
(#1920, #2226) - Fixed an issue with forward hooks not being removed after model summary (#2298)
- Fix for
load_from_checkpoint()
not working with absolute path on Windows (#2294) - Fixed an issue how _has_len handles
NotImplementedError
e.g. raised bytorchtext.data.Iterator
(#2293), (#2307) - Fixed
average_precision
metric (#2319) - Fixed ROC metric for CUDA tensors (#2304)
- Fixed
average_precision
metric (#2319) - Fixed lost compatibility with custom datatypes implementing
.to
(#2335) - Fixed loading model with kwargs (#2387)
- Fixed sum(0) for
trainer.num_val_batches
(#2268) - Fixed checking if the parameters are a
DictConfig
Object (#2216) - Fixed SLURM weights saving (#2341)
- Fixed swaps LR scheduler order (#2356)
- Fixed adding tensorboard
hparams
logging test (#2342) - Fixed use model ref for tear down (#2360)
- Fixed logger crash on DDP (#2388)
- Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
- Fixed loading past checkpoints from v0.7.x (#2405)
- Fixed loading model without arguments (#2403)
Contributors
@airium, @awaelchli, @Borda, @elias-ramzi, @jeremyjordan, @lezwon, @mateuszpieniak, @mmiakashs, @pwl, @rohitgr7, @ssakhavi, @thschaaf, @tridao, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Fixing hooks & hparams
Overview
Fixing critical bugs in newly added hooks and hparams
assignment.
The recommended data following:
- use
prepare_data
to download and process the dataset. - use
setup
to do splits, and build your model internals
Detail changes
Metrics, speed improvements, new hooks and flags
Overview
Highlights of this release are adding Metric package and new hooks and flags to customize your workflow.
Major features:
- brand new Metrics package with built-in DDP support (by @justusschock and @SkafteNicki)
hparams
can now be anything! (callself.save_hyperparameters()
to register anything in the_init_
- many speed improvements (how we move data, adjusted some flags & PL now adds 300ms overhead per epoch only!)
- much faster
ddp
implementation. Old one was renamedddp_spawn
- better support for Hydra
- added the overfit_batches flag and corrected some bugs with the
limit_[train,val,test]_batches
flag - added conda support
- tons of bug fixes 😉
Detail changes
Added
- Added
overfit_batches
,limit_{val|test}_batches
flags (overfit now uses training set for all three) (#2213) - Added metrics
- Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723) - Allow dataloaders without sampler field present (#1907)
- Added option
save_last
to save the model at the end of every epoch inModelCheckpoint
(#1908) - Early stopping checks
on_validation_end
(#1458) - Attribute
best_model_path
toModelCheckpoint
for storing and later retrieving the path to the best saved model file (#1799) - Speed up single-core TPU training by loading data using
ParallelLoader
(#2033) - Added a model hook
transfer_batch_to_device
that enables moving custom data structures to the target device (#1756) - Added black formatter for the code with code-checker on pull (#1610)
- Added back the slow spawn ddp implementation as
ddp_spawn
(#2115) - Added loading checkpoints from URLs (#1667)
- Added a callback method
on_keyboard_interrupt
for handling KeyboardInterrupt events during training (#2134) - Added a decorator
auto_move_data
that moves data to the correct device when using the LightningModule for inference (#1905) - Added
ckpt_path
option toLightningModule.test(...)
to load particular checkpoint (#2190) - Added
setup
andteardown
hooks for model (#2229)
Changed
- Allow user to select individual TPU core to train on (#1729)
- Removed non-finite values from loss in
LRFinder
(#1862) - Allow passing model hyperparameters as complete kwarg list (#1896)
- Renamed
ModelCheckpoint
's attributesbest
tobest_model_score
andkth_best_model
tokth_best_model_path
(#1799) - Re-Enable Logger's
ImportError
s (#1938) - Changed the default value of the Trainer argument
weights_summary
fromfull
totop
(#2029) - Raise an error when lightning replaces an existing sampler (#2020)
- Enabled prepare_data from correct processes - clarify local vs global rank (#2166)
- Remove explicit flush from tensorboard logger (#2126)
- Changed epoch indexing from 1 instead of 0 (#2206)
Deprecated
- Deprecated flags: (#2213)
overfit_pct
in favour ofoverfit_batches
val_percent_check
in favour oflimit_val_batches
test_percent_check
in favour oflimit_test_batches
- Deprecated
ModelCheckpoint
's attributesbest
andkth_best_model
(#1799) - Dropped official support/testing for older PyTorch versions <1.3 (#1917)
Removed
- Removed unintended Trainer argument
progress_bar_callback
, the callback should be passed in byTrainer(callbacks=[...])
instead (#1855) - Removed obsolete
self._device
in Trainer (#1849) - Removed deprecated API (#2073)
- Packages:
pytorch_lightning.pt_overrides
,pytorch_lightning.root_module
- Modules:
pytorch_lightning.logging.comet_logger
,pytorch_lightning.logging.mlflow_logger
,pytorch_lightning.logging.test_tube_logger
,pytorch_lightning.overrides.override_data_parallel
,pytorch_lightning.core.model_saving
,pytorch_lightning.core.root_module
- Trainer arguments:
add_row_log_interval
,default_save_path
,gradient_clip
,nb_gpu_nodes
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
- Trainer attributes:
nb_gpu_nodes
,num_gpu_nodes
,gradient_clip
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
,default_save_path
,tng_tqdm_dic
- Packages:
Fixed
- Run graceful training teardown on interpreter exit (#1631)
- Fixed user warning when apex was used together with learning rate schedulers (#1873)
- Fixed multiple calls of
EarlyStopping
callback (#1863) - Fixed an issue with
Trainer.from_argparse_args
when passing in unknown Trainer args (#1932) - Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
- Fixed root node resolution for SLURM cluster with dash in hostname (#1954)
- Fixed
LearningRateLogger
in multi-scheduler setting (#1944) - Fixed test configuration check and testing (#1804)
- Fixed an issue with Trainer constructor silently ignoring unknown/misspelt arguments (#1820)
- Fixed
save_weights_only
in ModelCheckpoint (#1780) - Allow use of same
WandbLogger
instance for multiple training loops (#2055) - Fixed an issue with
_auto_collect_arguments
collecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself
(#2048) - Fixed mistake in parameters' grad norm tracking (#2012)
- Fixed CPU and hanging GPU crash (#2118)
- Fixed an issue with the model summary and
example_input_array
depending on a specific ordering of the submodules in a LightningModule (#1773) - Fixed Tpu logging (#2230)
- Fixed Pid port + duplicate
rank_zero
logging (#2140, #2231)
Contributors
@awaelchli, @baldassarreFe, @Borda, @borisdayma, @cuent, @devashishshankar, @ivannz, @j-dsouza, @justusschock, @kepler, @kumuji, @lezwon, @lgvaz, @LoicGrobol, @mateuszpieniak, @maximsch2, @moi90, @rohitgr7, @SkafteNicki, @tullie, @williamFalcon, @yukw777, @ZhaofengWu
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Transfer learning, tuning batch size, torchelastic support
Overview
Highlights of this release are adding support for TorchElastic enables distributed PyTorch training jobs to be executed in a fault-tolerant and elastic manner; auto-scaling of batch size; new transfer learning example; an option to provide seed to random generators to ensure reproducibility.
Detail changes
Added
- Added callback for logging learning rates (#1498)
- Added transfer learning example (for a binary classification task in computer vision) (#1564)
- Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723). - Added auto scaling of batch size (#1638)
- The progress bar metrics now also get updated in
training_epoch_end
(#1724) - Enable
NeptuneLogger
to work withdistributed_backend=ddp
(#1753) - Added option to provide seed to random generators to ensure reproducibility (#1572)
- Added override for hparams in
load_from_ckpt
(#1797) - Added support multi-node distributed execution under
torchelastic
(#1811, #1818) - Added using
store_true
for bool args (#1822, #1842) - Added dummy logger for internally disabling logging for some features (#1836)
Changed
- Enable
non-blocking
for device transfers to GPU (#1843) - Replace mata_tags.csv with hparams.yaml (#1271)
- Reduction when
batch_size < num_gpus
(#1609) - Updated LightningTemplateModel to look more like Colab example (#1577)
- Don't convert
namedtuple
totuple
when transferring the batch to target device (#1589) - Allow passing
hparams
as a keyword argument to LightningModule when loading from checkpoint (#1639) - Args should come after the last positional argument (#1807)
- Made DDP the default if no backend specified with multiple GPUs (#1789)
Deprecated
- Deprecated
tags_csv
in favor ofhparams_file
(#1271)
Fixed
- Fixed broken link in PR template (#1675)
- Fixed ModelCheckpoint not None checking file path (#1654)
- Trainer now calls
on_load_checkpoint()
when resuming from a checkpoint (#1666) - Fixed sampler logic for DDP with the iterable dataset (#1734)
- Fixed
_reset_eval_dataloader()
for IterableDataset (#1560) - Fixed Horovod distributed backend to set the
root_gpu
property (#1669) - Fixed wandb logger
global_step
affects other loggers (#1492) - Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
- Fixed bugs that prevent LP finder to be used together with early stopping and validation dataloaders (#1676)
- Fixed a bug in Trainer that prepended the checkpoint path with
version_
when it shouldn't (#1748) - Fixed LR key name in case of param groups in LearningRateLogger (#1719)
- Fixed saving native AMP scaler state (introduced in #1561)
- Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
- Fixed num processes wasn't being set properly and auto sampler was DDP failing (#1819)
- Fixed bugs in semantic segmentation example (#1824)
- Fixed saving native AMP scaler state (#1561, #1777)
- Fixed native AMP + DDP (#1788)
- Fixed
hparam
logging with metrics (#1647)
Contributors
@ashwinb, @awaelchli, @Borda, @cmpute, @festeh, @jbschiratti, @justusschock, @kepler, @kumuji, @nanddalal, @nathanbreitsch, @olineumann, @pitercl, @rohitgr7, @S-aiueo32, @SkafteNicki, @tgaddair, @tullie, @tw991, @williamFalcon, @ybrovman, @yukw777
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Critical DDP bug fixes
We made a few changes to Callbacks to test ops on detached GPU tensors to avoid CPU transfer. However, it made callbacks unpicklable which will crash DDP.
This release fixes that core issue
Changed
- Allow logging of metrics together with hparams (#1630)
- Allow metrics logged together with hparams (#1630)
Removed
- Removed Warning from trainer loop (#1634)
Fixed
- Fixed ModelCheckpoint not being fixable (#1632)
- Fixed CPU DDP breaking change and DDP change (#1635)
- Tested pickling (#1636)
Contributors
PyTorch 1.5 support, native PyTorch AMP, speed/memory optimizations and many bug fixes
Key updates
- PyTorch 1.5 support
- Added Horovod distributed_backend option
- Enable forward compatibility with the native AMP (PyTorch 1.6).
- Support 8-core TPU on Kaggle
- Added ability to customize progress_bar via Callbacks
- Speed/memory optimizations.
- Improved Argparse usability with Trainer
- Docs improvements
- Tons of bug fixes
Detail changes
Added
- Added flag
replace_sampler_ddp
to manually disaple sampler replacement in ddp (#1513) - Added speed parity tests (max 1 sec difference per epoch)(#1482)
- Added
auto_select_gpus
flag to trainer that enables automatic selection of available GPUs on exclusive mode systems. - Added learining rate finder (#1347)
- Added support for ddp mode in clusters without SLURM (#1387)
- Added
test_dataloaders
parameter toTrainer.test()
(#1434) - Added
terminate_on_nan
flag to trainer that performs a NaN check with each training iteration when set toTrue
(#1475) - Added speed parity tests (max 1 sec difference per epoch)(#1482)
- Added
terminate_on_nan
flag to trainer that performs a NaN check with each training iteration when set toTrue
. (#1475) - Added
ddp_cpu
backend for testing ddp without GPUs (#1158) - Added Horovod support as a distributed backend
Trainer(distributed_backend='horovod')
(#1529) - Added support for 8 core distributed training on Kaggle TPU's (#1568)
- Added support for native AMP (#1561, [#1580)
Changed
- Changed the default behaviour to no longer include a NaN check with each training iteration. (#1475)
- Decoupled the progress bar from trainer. It is a callback now and can be customized or even be replaced entirely (#1450).
- Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
- Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
- Updated semantic segmentation example with custom u-net and logging (#1371)
- Disabled val and test shuffling (#1600)
Deprecated
- Deprecated
training_tqdm_dict
in favor ofprogress_bar_dict
(#1450).
Removed
- Removed
test_dataloaders
parameter fromTrainer.fit()
(#1434)
Fixed
- Added the possibility to pass nested metrics dictionaries to loggers (#1582)
- Fixed memory leak from opt return (#1528)
- Fixed saving checkpoint before deleting old ones (#1453)
- Fixed loggers - flushing last logged metrics even before continue, e.g.
trainer.test()
results (#1459) - Fixed optimizer configuration when
configure_optimizers
returns dict withoutlr_scheduler
(#1443) - Fixed
LightningModule
- mixing hparams and arguments inLightningModule.__init__()
crashes load_from_checkpoint() (#1505) - Added a missing call to the
on_before_zero_grad
model hook (#1493). - Allow use of sweeps with WandbLogger (#1512)
- Fixed a bug that caused the
callbacks
Trainer argument to reference a global variable (#1534). - Fixed a bug that set all boolean CLI arguments from Trainer.add_argparse_args always to True (#1571)
- Fixed do not copy the batch when training on a single GPU (#1576, [#1579)
- Fixed soft checkpoint removing on DDP (#1408)
- Fixed automatic parser bug (#1585)
- Fixed bool conversion from string (#1606)
Contributors
@alexeykarnachev, @areshytko, @awaelchli, @Borda, @borisdayma, @ethanwharris, @fschlatt, @HenryJia, @Ir1d, @justusschock, @karlinjf, @lezwon, @neggert, @rmrao, @rohitgr7, @SkafteNicki, @tgaddair, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
DDP bug fixes
We had a few (subtle) bugs that affected DDP and a few key things in 0.7.2 so we released 0.7.3 to fix them because they are critical for DDP. sorry about that! still, no API changes, but please do skip straight to 0.7.3 upgrade for those fixes
Detail changes
Added
- Added
rank_zero_warn
for warning only in rank 0 (#1428)
Fixed
- Fixed default
DistributedSampler
for DDP training (#1425) - Fixed workers warning not on windows (#1430)
- Fixed returning tuple from
run_training_batch
(#1431) - Fixed gradient clipping (#1438)
- Fixed pretty print (#1441)
Contributors
Many bug fixes, added flexibility, parity tests with pytorch and more
Overview
This release aims at fixing particular issues and improving the user development experience via extending docs, adding typing and supporting python 3.8. In particular, some of the release highlights are:
- Added benchmark for comparing lightning with vanilla implementations
- Extended optimizer support with particular frequency
- Several improvements for loggers such as represent no-primitive types, supporting hierarchical dictionaries for hyper param searchers
- Added model configuration checking before it runs
- Simplify the PL examples structure (shallower and more readable)
- Improved Trainer CLI arguments handling (generalization)
- Two Trainer argument become deprecated:
print_nan_grads
andshow_progress_bar
Detail changes
Added
- Added same step loggers' metrics aggregation (#1278)
- Added parity test between a vanilla MNIST model and lightning model (#1284)
- Added parity test between a vanilla RNN model and lightning model (#1351)
- Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
- Added support for hierarchical
dict
(#1152) - Added
TrainsLogger
class (#1122) - Added type hints to
pytorch_lightning.core
(#946) - Added support for
IterableDataset
in validation and testing (#1104) - Added support for non-primitive types in
hparams
forTensorboardLogger
(#1130) - Added a check that stops the training when loss or weights contain
NaN
orinf
values. (#1097) - Added support for
IterableDataset
whenval_check_interval=1.0
(default), this will trigger validation at the end of each epoch. (#1283) - Added
summary
method to Profilers. (#1259) - Added informative errors if user defined dataloader has zero length (#1280)
- Added testing for python 3.8 (#915)
- Added a
training_epoch_end
method which is the mirror ofvalidation_epoch_end
. (#1357) - Added model configuration checking (#1199)
- Added support for optimizer frequencies through
LightningModule.configure_optimizers()
(#1269) - Added option to run without an optimizer by returning
None
fromconfigure_optimizers
. (#1279) - Added a warning when the number of data loader workers is small. (#1378)
Changed
- Changed (renamed and refactored)
TensorRunningMean
->TensorRunningAccum
: running accumulations were generalized. (#1278) - Changed
progress_bar_refresh_rate
trainer flag to disable progress bar when setting to 0. (#1108) - Enhanced
load_from_checkpoint
to also forward params to the model (#1307) - Updated references to self.forward() to instead use the
__call__
interface. (#1211) - Changed default behaviour of
configure_optimizers
to use no optimizer rather than Adam. (#1279) - Allow uploading models on W&B (#1339)
- On DP and DDP2 unsqueeze is automated now (#1319)
- Did not always create a DataLoader during reinstantiation, but the same type as before (if a subclass of DataLoader) (#1346)
- Did not interfere with a default sampler (#1318)
- Removed default Adam optimizer (#1317)
- Gave warnings for unimplemented required lightning methods (#1317)
- Made
evaluate
method private >>Trainer._evaluate(...)
. (#1260) - Simplify the PL examples structure (shallower and more readable) (#1247)
- Changed min-max GPU memory to be on their own plots (#1358)
- Remove
.item
which causes sync issues (#1254) - Changed smoothing in TQDM to decrease variability of time remaining between training/eval (#1194)
- Change default logger to a dedicated one (#1064)
Deprecated
- Deprecated Trainer argument
print_nan_grads
(#1097) - Deprecated Trainer argument
show_progress_bar
(#1108)
Removed
- Removed duplicated module
pytorch_lightning.utilities.arg_parse
for loading CLI arguments (#1167) - Removed wandb logger's
finalize
method (#1193) - Dropped
torchvision
dependency in tests and added own MNIST dataset class instead (#986)
Fixed
- Fixed
model_checkpoint
when saving all models (#1359) Trainer.add_argparse_args
classmethod fixed. Now it adds a type for the arguments (#1147)- Fixed bug related to type cheking of
ReduceLROnPlateau
lr schedulers(#1114) - Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
- Fixed a bug that created an extra dataloader with active
reload_dataloaders_every_epoch
(#1181) - Fixed all warnings and errors in the docs build process (#1191)
- Fixed an issue where
val_percent_check=0
would not disable validation (#1251) - Fixed average of incomplete
TensorRunningMean
(#1309) - Fixed
WandbLogger.watch
withwandb.init()
(#1311) - Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235)
- Fixed a bug that would cause
trainer.test()
to run on the validation set when overloadingvalidation_epoch_end
andtest_end
(#1353) - Fixed
WandbLogger.watch
- use of the watch method without importingwandb
(#1311) - Fixed
WandbLogger
to be used with 'ddp' - allow reinits in sub-processes (#1149, #1360) - Made
training_epoch_end
behave likevalidation_epoch_end
(#1357) - Fixed
fast_dev_run
running validation twice (#1365) - Fixed pickle error from quick patch
__code__
(#1352) - Fixed memory leak on GPU0 (#1094, #1349)
- Fixed checkpointing interval (#1272)
- Fixed validation and training loops run the partial dataset (#1192)
- Fixed running
on_validation_end
only on main process in DDP (#1125) - Fixed
load_spawn_weights
only in proc rank 0 (#1385) - Fixes
use_amp
issue (#1145) - Fixes using deprecated
use_amp
attribute (#1145) - Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1375)
- Fixed
Unimplemented backend XLA
error on TPU (#1387)
Contributors
@alexeykarnachev, @amoudgl, @areshytko, @asafmanor, @awaelchli, @bkkaggle, @bmartinn, @Borda, @borisdayma, @cmpute, @djbyrne, @ethanwharris, @gerardrbentley, @jbschiratti, @jeremyjordan, @justusschock, @monney, @mpariente, @pertschuk, @rmrao, @S-aiueo32, @shubhamagarwal92, @SkafteNicki, @sneiman, @tullie, @vanpelt, @williamFalcon, @xingzhaolee
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor deprecation fix
Monir bug fix with print
issues and data_loader
(#1080)