Hardware specifications for IMN model #23

KJGithub2021 · 2023-09-26T07:51:48Z

Hello team @JasonForJoy,
Is there any document that lists the hardware requirements of the model, i.e. the minimum specs of your computer system to run the model using original UDC (900k training dialogs) on the default training parameters (i.e. 1000000 epochs and 128 batch_size with evaluation every 1000 step) ?

I have tried running it with a colab Pro account with a premium GPU A100 and high RAM enabled, using only a reduced dataset (around 10000 train dialogs) for just 4 epochs, and it took hell lot of time like around 15+ hrs just to just 4 epochs on this setting!

Am I missing something here that you guys can help me speed up the training time ??

Looking for fast and sincere help here.
Regards.

JasonForJoy · 2023-09-26T10:23:05Z

@KJGithub2021
All experiments were running on a single NVIDIA GeForce 1080 (12G) GPU card.
The default training parameters is 10 epochs and 96 batch_size with evaluation every 1000 steps. You can check it via the default script at https://github.com/JasonForJoy/IMN/blob/master/Ubuntu_V1/scripts/ubuntu_train.sh

KJGithub2021 · 2023-09-26T10:58:03Z

@KJGithub2021 All experiments were running on a single NVIDIA GeForce 1080 (12G) GPU card. The default training parameters is 10 epochs and 96 batch_size with evaluation every 1000 steps. You can check it via the default script at https://github.com/JasonForJoy/IMN/blob/master/Ubuntu_V1/scripts/ubuntu_train.sh

@JasonForJoy thanks for your reply. And yes I already checked the sh files, but the default configuration I was specifying was given in the train.py source code file instead. Anyhow I will use the one you pointed out.

Secondly can you let me know how long the model took to complete training and evaluation on the specs that you mentioned?

JasonForJoy · 2023-09-27T05:45:58Z

@KJGithub2021 It took about 90h (including evaluation on the dev set every 1000 steps) under the default setting, i.e., 10 epochs and 96 batch_size on a single NVIDIA GeForce 1080 (12G) GPU card.

KJGithub2021 · 2023-09-27T07:33:26Z

@KJGithub2021 It took about 90h (including evaluation on the dev set every 1000 steps) under the default setting, i.e., 10 epochs and 96 batch_size on a single NVIDIA GeForce 1080 (12G) GPU card.

@JasonForJoy and this time is taken on the original UDC V2 dataset, which is composed of 957101 train dialogs and 19560 valid dialogs ?
secondly, can this time (i.e. ~4 days of training and evaluation) be further reduced ? Is there any room for model optimization ?

JasonForJoy · 2023-09-28T07:29:47Z

@KJGithub2021
About 50h on the Ubuntu V2 dataset.
You might try:

Enlarge batch_size with more advanced GPU cards
Evaluate with less frequent steps, e.g. evaluate every 2k steps

KJGithub2021 · 2023-09-28T16:51:45Z

@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:

Enlarge batch_size with more advanced GPU cards

Evaluate with less frequent steps, e.g. evaluate every 2k steps

Okay thankyou for the information! will let you know if I come across anything.

KJGithub2021 · 2023-10-02T04:15:39Z

@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:

Enlarge batch_size with more advanced GPU cards

Evaluate with less frequent steps, e.g. evaluate every 2k steps

Okay thankyou for the information! will let you know if I come across anything.

@JasonForJoy
Do the model checkpoints, that are saved after each evaluation, help the model training to resume automatically in case the training disconnects at any point or the training needs to be manually resumed through coding?
Thanks.

KJGithub2021 · 2023-10-04T06:55:13Z

@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:

Enlarge batch_size with more advanced GPU cards

Evaluate with less frequent steps, e.g. evaluate every 2k steps

Okay thankyou for the information! will let you know if I come across anything.

@JasonForJoy Do the model checkpoints, that are saved after each evaluation, help the model training to resume automatically in case the training disconnects at any point or the training needs to be manually resumed through coding? Thanks.

@JasonForJoy
In continuation to my previous query, can you also throw some light on how to resume model training from a saved checkpoint using google colab ? As there must have been times you guys had to resume training due to longer runtimes...
I would really appreciate your help. Thanks.

KJGithub2021 · 2023-10-12T06:25:33Z

@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:

Enlarge batch_size with more advanced GPU cards

Evaluate with less frequent steps, e.g. evaluate every 2k steps

Okay thankyou for the information! will let you know if I come across anything.

@JasonForJoy Do the model checkpoints, that are saved after each evaluation, help the model training to resume automatically in case the training disconnects at any point or the training needs to be manually resumed through coding? Thanks.

@JasonForJoy can you kindly respond on this query and give some direction ? I really appreciate your help.

JasonForJoy · 2023-10-12T08:20:51Z

@KJGithub2021
Sorry, we do not have any experience of resuming model training from a saved checkpoint using google colab. No suggestion can be provided.

KJGithub2021 · 2023-10-12T18:23:11Z

@KJGithub2021 Sorry, we do not have any experience of resuming model training from a saved checkpoint using google colab. No suggestion can be provided.

Okay...but what was the purpose of your code to save model checkpoints ?

KJGithub2021 · 2023-10-13T14:07:06Z

@KJGithub2021 Sorry, we do not have any experience of resuming model training from a saved checkpoint using google colab. No suggestion can be provided.

@JasonForJoy Understood. But how did you plan to resume training through your code from saved checkpoint otherwise ?

KJGithub2021 · 2023-12-08T05:01:19Z

@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:

Enlarge batch_size with more advanced GPU cards

Evaluate with less frequent steps, e.g. evaluate every 2k steps

Okay thankyou for the information! will let you know if I come across anything.

@JasonForJoy Can you please confirm if you also used batch_size 96 for test dataset or 128?

KJGithub2021 · 2024-02-26T05:44:39Z

@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:

Enlarge batch_size with more advanced GPU cards

Evaluate with less frequent steps, e.g. evaluate every 2k steps

Hi @JasonForJoy can you please confirm if reducing the batch size (due to low-end GPU machine availability) can impact the performance values of the model ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardware specifications for IMN model #23

Hardware specifications for IMN model #23

KJGithub2021 commented Sep 26, 2023 •

edited

Loading

JasonForJoy commented Sep 26, 2023

KJGithub2021 commented Sep 26, 2023

JasonForJoy commented Sep 27, 2023

KJGithub2021 commented Sep 27, 2023 •

edited

Loading

JasonForJoy commented Sep 28, 2023

KJGithub2021 commented Sep 28, 2023

KJGithub2021 commented Oct 2, 2023

KJGithub2021 commented Oct 4, 2023

KJGithub2021 commented Oct 12, 2023

JasonForJoy commented Oct 12, 2023

KJGithub2021 commented Oct 12, 2023

KJGithub2021 commented Oct 13, 2023

KJGithub2021 commented Dec 8, 2023

KJGithub2021 commented Feb 26, 2024

Hardware specifications for IMN model #23

Hardware specifications for IMN model #23

Comments

KJGithub2021 commented Sep 26, 2023 • edited Loading

JasonForJoy commented Sep 26, 2023

KJGithub2021 commented Sep 26, 2023

JasonForJoy commented Sep 27, 2023

KJGithub2021 commented Sep 27, 2023 • edited Loading

JasonForJoy commented Sep 28, 2023

KJGithub2021 commented Sep 28, 2023

KJGithub2021 commented Oct 2, 2023

KJGithub2021 commented Oct 4, 2023

KJGithub2021 commented Oct 12, 2023

JasonForJoy commented Oct 12, 2023

KJGithub2021 commented Oct 12, 2023

KJGithub2021 commented Oct 13, 2023

KJGithub2021 commented Dec 8, 2023

KJGithub2021 commented Feb 26, 2024

KJGithub2021 commented Sep 26, 2023 •

edited

Loading

KJGithub2021 commented Sep 27, 2023 •

edited

Loading