Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colab problem: continue previous training #396

Open
olaviinha opened this issue Jul 24, 2020 · 4 comments
Open

Colab problem: continue previous training #396

olaviinha opened this issue Jul 24, 2020 · 4 comments

Comments

@olaviinha
Copy link

olaviinha commented Jul 24, 2020

I am using Colab (w/ %tensorflow_version 1.x) to run it and Google Drive to store all the related data.

It starts training from step 0 every time (along with a bunch of warnings) despite seemingly finding and restoring a previous checkpoint correctly in the beiginning.

Has anybody had any luck in continuing previous training in Colab?

Trying to restore saved checkpoints from /<logdir_root>/train/2020-07-20T11-44-41/ ...  Checkpoint found: /<logdir_root>/train/2020-07-20T11-44-41/model.ckpt-1396
  Global step was: 1396
  Restoring... Done.
WARNING:tensorflow:From train.py:289: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:`tf.train.start_queue_runners()` was called when no queue runners were defined. You can safely remove the call to this deprecated function.
files length: 4
2020-07-21 15:07:46.884203: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-21 15:07:47.979702: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
step 0 - loss = 1.931, (20.117 sec/step)
Storing checkpoint to /<logdir_root>/train/2020-07-21T15-07-00 ...WARNING:tensorflow:Issue encountered when serializing variables.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'filter_bias' has type str, but expected one of: int, long, bool
WARNING:tensorflow:Issue encountered when serializing trainable_variables.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'filter_bias' has type str, but expected one of: int, long, bool
 Done.
step 1 - loss = 1.902, (0.692 sec/step)
step 2 - loss = 1.954, (0.692 sec/step)
step 3 - loss = 1.895, (0.692 sec/step)
@ileanna
Copy link

ileanna commented Dec 13, 2020

@olaviinha Hi, having the same warning -not only on colab though.

A quick workaround for restoring the model properly is to set the --logdir arg to the dir where your model was saved, and not use --restore_from.
--logdir = logdir/train/model_dir
That worked for me, restores the global step and cntinues where it left off, in THAT folder.

However I cannot figure out the warning! back in April I trained with tf 1.3 and everything was ok.. I suspect that between versions 1.3 --> 1.15(the one that's on colab) there have been changes to the Saver class. So I'm looking into that..
Did you manage to resolve it?

@nschmidtg
Copy link

@ileanna hi! --logdir is an argument of which method?

I am doing

python train.py --data_dir=MY_PATH --logdir = /content/tensorflow-wavenet/logdir/train/2021-01-13T17-47-15/model.ckpt-200

but it wont work...

@ileanna
Copy link

ileanna commented Jan 28, 2021

@nschmidtg hello! as --logdir specifies the directory where train logs are stored you need to point to a folder.

So in your case it should be
python train.py --data_dir=MY_PATH --logdir = /content/tensorflow-wavenet/logdir/train/2021-01-13T17-47-15/
so without the model.ckpt-200 file.. only the folder with the checkpoints.

I hope it works like this!

@nschmidtg
Copy link

@nschmidtg hello! as --logdir specifies the directory where train logs are stored you need to point to a folder.

So in your case it should be
python train.py --data_dir=MY_PATH --logdir = /content/tensorflow-wavenet/logdir/train/2021-01-13T17-47-15/
so without the model.ckpt-200 file.. only the folder with the checkpoints.

I hope it works like this!

Thanks! It did work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants