Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Resets and Memory Leak in Default Training #8

Open
isaac-racine opened this issue Nov 8, 2024 · 6 comments
Open

Issues with Resets and Memory Leak in Default Training #8

isaac-racine opened this issue Nov 8, 2024 · 6 comments

Comments

@isaac-racine
Copy link

Hello,

I'm testing the default training configuration "combo_go2ARX5_pickle_reaching_extreme" and ran into some issues that I could use help with.

Expected Training Outcome: Without modifying the code, should the robot be able to follow the tossing end-effector (EE) trajectory? For me at around the 200 iteration mark, the robots start resetting instantaneously, seemingly due to a termination criterion. This behavior continues until the end of the full 20,000 iterations. So the result are not good.

GPU Memory Leak: Also starting at around the 200 iteration mark, GPU memory usage steadily increases over several hundred iterations until the training crashes. I made a modification in env.py to address this:

Original:

self.obs_history = torch.cat((self.obs_history[:, 1:, :], obs.unsqueeze(1)), dim=1)

Modified:

    self.obs_history[:, :-1, :] = self.obs_history[:, 1:, :]
    self.obs_history[:, -1, :] = obs

This change seems to prevent the memory increase, but training results remain the same. Do you have any insights on this issue?

System Specs:

OS: Ubuntu 22.04
GPU: NVIDIA GeForce RTX 4090 (25 GB memory)
Environment: Miniconda3 with IsaacGym_Preview_4

Thank you very much for your help!

@huy-ha
Copy link
Member

huy-ha commented Nov 8, 2024

I don't expect the instant resets. Just to clarify, you are using the exact code version on master right now with no modifications, the configuration in combo_go2ARX5_pickle_reaching_extreme without overriding any hyperparameters other than the ones included in the default command, and using our task trajectory dataset?

For clarity, this is the default command with default overrides I provided in the README

python scripts/train.py env.sim_device=cuda:0 env.graphics_device_id=0 env.tasks.reaching.sequence_sampler.file_path=data/tossing.pkl

For the GPU memory leak, I've observed Isaac Gym leaks memory due to contacts. It can happen much much later in training (say 20k iterations), but 200 is way too early.

I have trained on a very similar if not identical system setup before, so I don't believe it's a systems issue.

@isaac-racine
Copy link
Author

Thank you for the fast reply!

Yes, I am using the current code on the main branch with no modifications, with your tossing.pkl dataset and using the default command that you showed. I am gonna try it on a different pc with a new ubuntu 22.04 and let you know if it works properly.

@isaac-racine
Copy link
Author

Update

So I tried to run the training on a new pc and the results are the same. Without changing the code the gpu memory starts to increase and the training crashes a little bit after 500 iterations. The training seems to go well before that so it must be the contacts.

@yolo01826
Copy link

image
I encountered the same issue in the cup placement task without modifying the code.
(ubuntu20.04,RTX4090)

@huy-ha
Copy link
Member

huy-ha commented Nov 13, 2024

Ah! When using cup in the wild trajectories, you should add some z height to the trajectory. For instance, adding 'env.tasks.reaching.sequence_sampler. add_random_height_range=[0.4,0.5]' to the command should randomly add 40cm to 50cm to the trajectory z index, which is more realistic for this particular task. This would also prevent the robot from crawling on the ground the entire time, which should prevent contact buffers from taking up so much memory.

@yolo01826
Copy link

thanks a lot, it works🌹

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants