DLR-RM · araffin · Nov 8, 2024 · Nov 8, 2024 · Nov 8, 2024
diff --git a/docs/conda_env.yml b/docs/conda_env.yml
@@ -14,6 +14,6 @@ dependencies:
     - pandas
     - numpy>=1.20,<2.0
     - matplotlib
-    - sphinx>=5,<8
+    - sphinx>=5,<9
     - sphinx_rtd_theme>=1.3.0
     - sphinx_copybutton
diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst
@@ -59,6 +59,7 @@ Bug Fixes:
 `SBX`_ (SB3 + Jax)
 ^^^^^^^^^^^^^^^^^^
 - Added CNN support for DQN
+- Bug fix for SAC and related algorithms, optimize log of ent coeff to be consistent with SB3
 
 Deprecations:
 ^^^^^^^^^^^^^
@@ -80,6 +81,7 @@ Documentation:
 ^^^^^^^^^^^^^^
 - Updated PPO doc to recommend using CPU with ``MlpPolicy``
 - Clarified documentation about planned features and citing software
+- Added a note about the fact we are optimizing log of ent coeff for SAC
 
 Release 2.3.2 (2024-04-27)
 --------------------------

diff --git a/docs/modules/dqn.rst b/docs/modules/dqn.rst
@@ -25,6 +25,7 @@ Notes
 
 - Original paper: https://arxiv.org/abs/1312.5602
 - Further reference: https://www.nature.com/articles/nature14236
+- Tutorial "From Tabular Q-Learning to DQN": https://github.com/araffin/rlss23-dqn-tutorial
 
 .. note::
     This implementation provides only vanilla Deep Q-Learning and has no extensions such as Double-DQN, Dueling-DQN and Prioritized Experience Replay.

diff --git a/docs/modules/sac.rst b/docs/modules/sac.rst
@@ -35,6 +35,9 @@ Notes
     which is the equivalent to the inverse of reward scale in the original SAC paper.
     The main reason is that it avoids having too high errors when updating the Q functions.
 
+.. note::
+    When automatically adjusting the temperature (alpha/entropy coefficient), we optimize the logarithm of the entropy coefficient instead of the entropy coefficient itself. This is consistent with the original implementation and has proven to be more stable
+    (see issues `GH#36 <https://github.com/DLR-RM/stable-baselines3/issues/36>`_, `#55 <https://github.com/araffin/sbx/issues/55>`_ and others).
 
 .. note::
 

diff --git a/setup.py b/setup.py
@@ -101,7 +101,7 @@
             "black>=24.2.0,<25",
         ],
         "docs": [
-            "sphinx>=5,<8",
+            "sphinx>=5,<9",
             "sphinx-autobuild",
             "sphinx-rtd-theme>=1.3.0",
             # For spelling