You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My question relates to the "segment size" parameter. You use 8192, which is 372 ms @ 22050 Hz sampling rate. If I have computed correctly, the receptive field width is roughly 300 ms (?) in the v1/v2 configurations. That would mean that during training most of the generated audio is affected by padding in the convolutions. How did you choose the "segment size" parameter? Is there a trade-off between reducing the effect of padding and achieving enough speaker variability within each batch in multi-speaker training. Or does the padding even act as a regularization?
Regards
The text was updated successfully, but these errors were encountered:
My question relates to the "segment size" parameter. You use 8192, which is 372 ms @ 22050 Hz sampling rate. If I have computed correctly, the receptive field width is roughly 300 ms (?) in the v1/v2 configurations. That would mean that during training most of the generated audio is affected by padding in the convolutions. How did you choose the "segment size" parameter? Is there a trade-off between reducing the effect of padding and achieving enough speaker variability within each batch in multi-speaker training. Or does the padding even act as a regularization?
Regards
The text was updated successfully, but these errors were encountered: