Issues with Resets and Memory Leak in Default Training #8

isaac-racine · 2024-11-08T05:55:51Z

Hello,

I'm testing the default training configuration "combo_go2ARX5_pickle_reaching_extreme" and ran into some issues that I could use help with.

Expected Training Outcome: Without modifying the code, should the robot be able to follow the tossing end-effector (EE) trajectory? For me at around the 200 iteration mark, the robots start resetting instantaneously, seemingly due to a termination criterion. This behavior continues until the end of the full 20,000 iterations. So the result are not good.

GPU Memory Leak: Also starting at around the 200 iteration mark, GPU memory usage steadily increases over several hundred iterations until the training crashes. I made a modification in env.py to address this:

Original:

self.obs_history = torch.cat((self.obs_history[:, 1:, :], obs.unsqueeze(1)), dim=1)

Modified:

    self.obs_history[:, :-1, :] = self.obs_history[:, 1:, :]
    self.obs_history[:, -1, :] = obs

This change seems to prevent the memory increase, but training results remain the same. Do you have any insights on this issue?

System Specs:

OS: Ubuntu 22.04
GPU: NVIDIA GeForce RTX 4090 (25 GB memory)
Environment: Miniconda3 with IsaacGym_Preview_4

Thank you very much for your help!

The text was updated successfully, but these errors were encountered:

huy-ha · 2024-11-08T07:41:12Z

I don't expect the instant resets. Just to clarify, you are using the exact code version on master right now with no modifications, the configuration in combo_go2ARX5_pickle_reaching_extreme without overriding any hyperparameters other than the ones included in the default command, and using our task trajectory dataset?

For clarity, this is the default command with default overrides I provided in the README

python scripts/train.py env.sim_device=cuda:0 env.graphics_device_id=0 env.tasks.reaching.sequence_sampler.file_path=data/tossing.pkl

For the GPU memory leak, I've observed Isaac Gym leaks memory due to contacts. It can happen much much later in training (say 20k iterations), but 200 is way too early.

I have trained on a very similar if not identical system setup before, so I don't believe it's a systems issue.

isaac-racine · 2024-11-08T07:58:05Z

Thank you for the fast reply!

Yes, I am using the current code on the main branch with no modifications, with your tossing.pkl dataset and using the default command that you showed. I am gonna try it on a different pc with a new ubuntu 22.04 and let you know if it works properly.

isaac-racine · 2024-11-12T07:02:10Z

Update

So I tried to run the training on a new pc and the results are the same. Without changing the code the gpu memory starts to increase and the training crashes a little bit after 500 iterations. The training seems to go well before that so it must be the contacts.

yolo01826 · 2024-11-13T15:42:31Z

I encountered the same issue in the cup placement task without modifying the code.
（ubuntu20.04，RTX4090）

huy-ha · 2024-11-13T15:57:11Z

Ah! When using cup in the wild trajectories, you should add some z height to the trajectory. For instance, adding 'env.tasks.reaching.sequence_sampler. add_random_height_range=[0.4,0.5]' to the command should randomly add 40cm to 50cm to the trajectory z index, which is more realistic for this particular task. This would also prevent the robot from crawling on the ground the entire time, which should prevent contact buffers from taking up so much memory.

yolo01826 · 2024-11-14T08:27:54Z

thanks a lot, it works🌹

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Resets and Memory Leak in Default Training #8

Issues with Resets and Memory Leak in Default Training #8

isaac-racine commented Nov 8, 2024

huy-ha commented Nov 8, 2024

isaac-racine commented Nov 8, 2024

isaac-racine commented Nov 12, 2024

yolo01826 commented Nov 13, 2024

huy-ha commented Nov 13, 2024

yolo01826 commented Nov 14, 2024

Issues with Resets and Memory Leak in Default Training #8

Issues with Resets and Memory Leak in Default Training #8

Comments

isaac-racine commented Nov 8, 2024

huy-ha commented Nov 8, 2024

isaac-racine commented Nov 8, 2024

isaac-racine commented Nov 12, 2024

yolo01826 commented Nov 13, 2024

huy-ha commented Nov 13, 2024

yolo01826 commented Nov 14, 2024