-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with Resets and Memory Leak in Default Training #8
Comments
I don't expect the instant resets. Just to clarify, you are using the exact code version on master right now with no modifications, the configuration in For clarity, this is the default command with default overrides I provided in the README
For the GPU memory leak, I've observed Isaac Gym leaks memory due to contacts. It can happen much much later in training (say 20k iterations), but 200 is way too early. I have trained on a very similar if not identical system setup before, so I don't believe it's a systems issue. |
Thank you for the fast reply! Yes, I am using the current code on the main branch with no modifications, with your tossing.pkl dataset and using the default command that you showed. I am gonna try it on a different pc with a new ubuntu 22.04 and let you know if it works properly. |
Update So I tried to run the training on a new pc and the results are the same. Without changing the code the gpu memory starts to increase and the training crashes a little bit after 500 iterations. The training seems to go well before that so it must be the contacts. |
Ah! When using cup in the wild trajectories, you should add some z height to the trajectory. For instance, adding 'env.tasks.reaching.sequence_sampler. add_random_height_range=[0.4,0.5]' to the command should randomly add 40cm to 50cm to the trajectory z index, which is more realistic for this particular task. This would also prevent the robot from crawling on the ground the entire time, which should prevent contact buffers from taking up so much memory. |
thanks a lot, it works🌹 |
Hello,
I'm testing the default training configuration "combo_go2ARX5_pickle_reaching_extreme" and ran into some issues that I could use help with.
Expected Training Outcome: Without modifying the code, should the robot be able to follow the tossing end-effector (EE) trajectory? For me at around the 200 iteration mark, the robots start resetting instantaneously, seemingly due to a termination criterion. This behavior continues until the end of the full 20,000 iterations. So the result are not good.
GPU Memory Leak: Also starting at around the 200 iteration mark, GPU memory usage steadily increases over several hundred iterations until the training crashes. I made a modification in env.py to address this:
Original:
self.obs_history = torch.cat((self.obs_history[:, 1:, :], obs.unsqueeze(1)), dim=1)
Modified:
This change seems to prevent the memory increase, but training results remain the same. Do you have any insights on this issue?
System Specs:
Thank you very much for your help!
The text was updated successfully, but these errors were encountered: