Skip to content
This repository has been archived by the owner on Jul 2, 2024. It is now read-only.

gradient error in Joint Optimization #25

Open
hongsiyu opened this issue Sep 26, 2022 · 6 comments
Open

gradient error in Joint Optimization #25

hongsiyu opened this issue Sep 26, 2022 · 6 comments

Comments

@hongsiyu
Copy link

I train successfully in shape pre-training but stuck in joint optimization.

2022-09-27 02:30:25.358618: E tensorflow/core/kernels/check_numerics_op.cc:289] abnormal_detected_host @0x7f43f6808a00 = {1, 0} Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo'
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
[[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]]
[[Identity_6/_372]]
(1) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
[[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_train_step_45946]

@hongsiyu
Copy link
Author

I use my own data which's cameras are calculated by colmap.

@Jiangyu1181
Copy link

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

@hongsiyu
Copy link
Author

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

@Jiangyu1181
Copy link

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

@hongsiyu
Copy link
Author

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

Yep, I directly change lr in config of shape_mvs.ini

@hongsiyu
Copy link
Author

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

Yep, I directly change lr in config of shape_mvs.ini

and nerfactor_mvs.ini

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants