Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Save JIT after training ACT #29

Open
razataiab opened this issue Aug 21, 2024 · 0 comments
Open

Unable to Save JIT after training ACT #29

razataiab opened this issue Aug 21, 2024 · 0 comments

Comments

@razataiab
Copy link

After setting up and crosschecking teleop_hand.py, it successfully streams the 3D hands as expected.

Moving onto the Training Guide, we setup the dataset from the provided drive and processed it successfully.

After training ACT, this is the output we got:

python imitate_episodes.py --policy_class ACT --kl_weight 10 --chunk_size 60 --hidden_dim 512 --batch_size 45 --dim_feedforward 3200 --num_epochs 50000 --lr 5e-5 --seed 0 --taskid 00 --exptid 01-sample-expt


Task name: 00-can-sorting


wandb: Currently logged in as: ayaans1804 (ayaans1804-nottingham-trent-university). Use wandb login --relogin to force relogin
wandb: wandb version 0.17.7 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.17.3
wandb: Run data is saved locally in ../data/logs/wandb/run-20240821_202622-8p5d9thq
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run 01-sample-expt
wandb: ⭐️ View project at https://wandb.ai/ayaans1804-nottingham-trent-university/television
wandb: 🚀 View run at https://wandb.ai/ayaans1804-nottingham-trent-university/television/runs/8p5d9thq

Data from: /home/robot/Desktop/TeleVision/data/recordings/00-can-sorting/processed

Train episodes: 9, Val episodes: 1
/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 24 worker processes in total. Our suggested max number of worker in current system is 20, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Using cache found in /home/robot/.cache/torch/hub/facebookresearch_dinov2_main
/home/robot/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/swiglu_ffn.py:51: UserWarning: xFormers is not available (SwiGLU)
warnings.warn("xFormers is not available (SwiGLU)")
/home/robot/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/attention.py:33: UserWarning: xFormers is not available (Attention)
warnings.warn("xFormers is not available (Attention)")
/home/robot/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/block.py:40: UserWarning: xFormers is not available (Block)
warnings.warn("xFormers is not available (Block)")
number of parameters: 94.75M
KL Weight 10
0%| | 0/50000 [00:00<?, ?it/s]
Epoch 0
Val loss: 83.19358
val/l1: 0.878 val/kl: 8.232 val/loss: 83.194
/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 24 worker processes in total. Our suggested max number of worker in current system is 20, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
0%| | 0/50000 [00:27<?, ?it/s]
Traceback (most recent call last):
File "imitate_episodes.py", line 367, in
main(args)
File "imitate_episodes.py", line 131, in main
best_ckpt_info = train_bc(train_dataloader, val_dataloader, config)
File "imitate_episodes.py", line 241, in train_bc
forward_dict = forward_pass(data, policy)
File "imitate_episodes.py", line 173, in forward_pass
return policy(qpos_data, image_data, action_data, is_pad) # TODO remove None
File "/home/robot/Desktop/TeleVision/act/policy.py", line 58, in call
a_hat, is_pad_hat, (mu, logvar) = self.model(qpos, image, env_state, actions, is_pad)
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/robot/Desktop/TeleVision/act/detr/models/detr_vae.py", line 149, in forward
hs = self.transformer(src, None, self.query_embed.weight, pos, latent_input, proprio_input, self.additional_pos_embed.weight)[0]
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/robot/Desktop/TeleVision/act/detr/models/transformer.py", line 73, in forward
memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/robot/Desktop/TeleVision/act/detr/models/transformer.py", line 94, in forward
output = layer(output, src_mask=mask,
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/robot/Desktop/TeleVision/act/detr/models/transformer.py", line 201, in forward
return self.forward_post(src, src_mask, src_key_padding_mask, pos)
File "/home/robot/Desktop/TeleVision/act/detr/models/transformer.py", line 176, in forward_post
src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/functional.py", line 1500, in relu
result = torch.relu(input)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 388.00 MiB. GPU
wandb: | 0.016 MB of 0.016 MB uploaded
wandb: Run history:
wandb: val/kl ▁
wandb: val/l1 ▁
wandb: val/loss ▁
wandb:
wandb: Run summary:
wandb: val/kl 8.23156
wandb: val/l1 0.87797
wandb: val/loss 83.19358
wandb:
wandb: 🚀 View run 01-sample-expt at: https://wandb.ai/ayaans1804-nottingham-trent-university/television/runs/8p5d9thq
wandb: ⭐️ View project at: https://wandb.ai/ayaans1804-nottingham-trent-university/television
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ../data/logs/wandb/run-20240821_202622-8p5d9thq/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.

And this is the output when we try to save JIT:

python imitate_episodes.py --policy_class ACT --kl_weight 10 --chunk_size 60 --hidden_dim 512 --batch_size 45 --dim_feedforward 3200 --num_epochs 50000 --lr 5e-5 --seed 0 --taskid 00 --exptid 01-sample-expt
--save_jit --resume_ckpt 25000


Task name: 00-can-sorting


Data from: /home/robot/Desktop/TeleVision/data/recordings/00-can-sorting/processed

Train episodes: 9, Val episodes: 1
/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 24 worker processes in total. Our suggested max number of worker in current system is 20, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Using cache found in /home/robot/.cache/torch/hub/facebookresearch_dinov2_main
/home/robot/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/swiglu_ffn.py:51: UserWarning: xFormers is not available (SwiGLU)
warnings.warn("xFormers is not available (SwiGLU)")
/home/robot/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/attention.py:33: UserWarning: xFormers is not available (Attention)
warnings.warn("xFormers is not available (Attention)")
/home/robot/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/block.py:40: UserWarning: xFormers is not available (Block)
warnings.warn("xFormers is not available (Block)")
number of parameters: 94.75M
KL Weight 10


Resuming from /home/robot/Desktop/TeleVision/data/logs/00-can-sorting/01-sample-expt/policy_epoch_25000_seed_0.ckpt


Traceback (most recent call last):
File "imitate_episodes.py", line 367, in
main(args)
File "imitate_episodes.py", line 128, in main
save_jit(config)
File "imitate_episodes.py", line 317, in save_jit
policy, ckpt_name, epoch = load_ckpt(policy, exp_dir, config['resume_ckpt'])
File "imitate_episodes.py", line 304, in load_ckpt
policy.load_state_dict(torch.load(resume_path))
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/serialization.py", line 997, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/serialization.py", line 444, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/serialization.py", line 425, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/robot/Desktop/TeleVision/data/logs/00-can-sorting/01-sample-expt/policy_epoch_25000_seed_0.ckpt'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant