About value_norm in ppo #172

t13m · 2022-01-02T09:04:41Z

t13m
Jan 2, 2022

If value_norm is set to be True during PPO training, the value will be multiplied by its std as in following code.

DI-engine/ding/policy/ppo.py

Line 176 in 0b71fc4

value *= self._running_mean_std.std

Can anyone elaborate the reason behind this? If I understand correctly, value should be divided by std rather than to multiply, am I missing something?

Another question, is there any example project implementing A3C with DI-Engine?

Thank you

Answered by puyuan1996

Jan 3, 2022

First of all, thank you very much for your question.

The key insight of value normalization is that neural networks can more easily fit normalized data. Regarding the principle and experimental results of value normalization,

in the single agent scenario, you can refer to this paper, What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study Section 3.3 Normalization and clipping and Appendix B.9,
in the multi-agent scenario, you can refer to this paper, The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games Section 3.1 Value Normalization.

We use the normalizaed value in critic loss calculation and use the original unormalized value in advantage calc…

View full answer

puyuan1996 · 2022-01-03T16:31:57Z

puyuan1996
Jan 3, 2022
Maintainer

First of all, thank you very much for your question.

The key insight of value normalization is that neural networks can more easily fit normalized data. Regarding the principle and experimental results of value normalization,

in the single agent scenario, you can refer to this paper, What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study Section 3.3 Normalization and clipping and Appendix B.9,
in the multi-agent scenario, you can refer to this paper, The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games Section 3.1 Value Normalization.

We use the normalizaed value in critic loss calculation and use the original unormalized value in advantage calculation, because we have another key adv_norm to decide whether to normalize the advantages in addition.

In the following, mainly explains our implementation.

Here in this line:

DI-engine/ding/policy/ppo.py

Line 173 in 0b71fc4

value = self._learn_model.forward(data['obs'], mode='compute_critic')['value']

the value are the original output of the value network, which are expected to be limited to the normalized space for the aforementioned reason, note that in our implementation we only normalize value to variance-one, not mean-zero, because our experiments show that subtracting the mean does not have a particularly obvious benefit on most tasks. To make the value network regress to the normalized values during value learning, in this line:

DI-engine/ding/policy/ppo.py

Line 176 in 0b71fc4

value *= self._running_mean_std.std

we multiply std to denormalize the output of the value network, because in this line:

DI-engine/ding/policy/ppo.py

Line 180 in 0b71fc4

data['adv'] = gae(compute_adv_data, self._gamma, self._gae_lambda)

it's necessary to compute advantage using the unnormalized values. Why we need to use unnormalized values? Because the reward is usually unnormalized, we must be consistent with it to compute the original well-defined advantage correctly.

And please note that when we compute the ppo_loss in this line:

DI-engine/ding/policy/ppo.py

Line 213 in 0b71fc4

ppo_loss, ppo_info = ppo_error_continuous(ppo_batch, self._clip_ratio)

because the ppo value_loss is used to backpropagate the gradients to update the value network, so in the calculation of value_loss:

DI-engine/ding/rl_utils/ppo.py

Line 203 in 0b71fc4

value_clip = value_old + (value_new - value_old).clamp(-clip_ratio, clip_ratio)

value_old i.e. the batch['value'] should be the original output of the value network(i.e. the normalized value), so here in this line

DI-engine/ding/policy/ppo.py

Line 185 in 0b71fc4

data['value'] = value / self._running_mean_std.std

we must divide unnormalized value by std, and save it to the data[‘value’].

It's the same case for the normalization and denormalization of return and next_value.

Hope the above answer can answer some of your questions.

Thanks a lot.
Best Wishs.

1 reply

t13m Jan 4, 2022
Author

Thanks vvery much for the detailed answer! It helps a lot.

I was trying to use ppo in one of my custom environments (custom wrapped gobigger environment, actually). It didnt work out because of the explosion of value. The value starts at about 1e3, and increases rapidly after the denormalization and explodes to be inf within several training iterations. I guess the problem probably lies in the design of reward, will dig in later. Forgive me for the stupid questions and thank you!

PaParaZz1 · 2022-01-04T05:03:54Z

PaParaZz1
Jan 4, 2022
Maintainer

As for A3C, we select to implement reduce gradients methods so we abandon some sync gradients methods such as A3C. Why do you want to use A3C, for faster training speed? We think other alternatives can be better. Can you offer mode details about your demand?

3 replies

t13m Jan 4, 2022
Author

You mean '...abandon some async gradients methods...'?

Yes, I thought an async actor critic method is worth a try for faster training speed and also for better utilization of computation resource. I'm new to the RL field and currently trying to put my effort into the gobigger-challenge (with limited resource). What other approaches would you like to suggest as alternative?

PaParaZz1 Jan 4, 2022
Maintainer

If you want to speed up your training program or improve computation resource, you first need to do profile and monitor computation resource to know the key bottleneck or overhead in the whole training pipeline (env simulation, data processing, model inference or any other components), then select proper scheme to solve it (such as optimizing for-loop with parallel computation api for low efficient data processing).

As for distributed RL algorithm, if you find your data collecting time and model training time are close, you can try A-pex or IMPALA to asynchronously execute them.

t13m Jan 4, 2022
Author

Thanks for your response, it's very helpful. Will dig in to see what to do next.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About value_norm in ppo #172

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

About value_norm in ppo #172

t13m Jan 2, 2022

Replies: 2 comments · 4 replies

puyuan1996 Jan 3, 2022 Maintainer

t13m Jan 4, 2022 Author

PaParaZz1 Jan 4, 2022 Maintainer

t13m Jan 4, 2022 Author

PaParaZz1 Jan 4, 2022 Maintainer

t13m Jan 4, 2022 Author

t13m
Jan 2, 2022

Replies: 2 comments 4 replies

puyuan1996
Jan 3, 2022
Maintainer

t13m Jan 4, 2022
Author

PaParaZz1
Jan 4, 2022
Maintainer

t13m Jan 4, 2022
Author

PaParaZz1 Jan 4, 2022
Maintainer

t13m Jan 4, 2022
Author