Audio quality improvements #21

janvainer · 2021-03-28T14:05:22Z

Hi, awesome contribution for TTS community :) I am wondering, did you manage to train a model that would have higher audio quality than the pretrained checkpoint provided with this repo? The audio samples seem to have lower quality than the ones presented in the paper. Any ideas what might be missing?

I am now training the model from scratch and the audio samples are very noisy now (approx 12 hours on 2 GPUs, batch size 128). It is getting better, but I am curious in some upper bound on the quality with the provided source code.

ivanvovk · 2021-03-28T14:33:21Z

@janvainer Hey, thanks, man. Yeah, the samples are of a bit lower quality than ones presented in demo page of the paper. However, authors used their personal proprietary dataset for training, where the female had much lower pitch than Linda (it is always hard to train on LJ). And I noticed that the less iterations you make, model reconstructs the less accurate higher frequencies. But I also think there might be some issues in diffusion calculations. I can suggest you to look towards lucidrains code and reuse forward and backward DDPM calculations with improved cosine schedules (maybe this can help): https://github.com/lucidrains/denoising-diffusion-pytorch. His repo follows the paper https://arxiv.org/pdf/2102.09672.pdf. I am going to return to this WaveGrad repo and gain its best quality, finally, once all my other projects are finished. But I think it can be delayed till summer. Also, you can check Mozilla's TTS library, I remember some guys from there interested in WaveGrad and they even added WaveGrad to their codebase: https://github.com/mozilla/TTS. Hope, it can help you.

janvainer · 2021-03-29T07:21:03Z

Thanks for swift repsonse :) I will check the diffusion calculations. I also tried the mozzila version, but the quality of the synthesized audio seemed a bit lower to me, at least for the WaveGrad vocoder combined with tacotron 2. There is this weird high freq noise.

On a side note, I am getting increasing L1 test batch loss, while the l1 test spec batch loss is going down. Did you experience the same behavior?

ivanvovk · 2021-03-29T16:33:53Z

@janvainer Yes, actually, I remember in my experiments that loss was not representative at all, spectral was more informative. I think such behavior is okay, don't pay attention to this.

janvainer · 2021-04-04T14:59:01Z

Ok thanks! :)

yijingshihenxiule · 2022-06-14T02:31:32Z

Hello, @janvainer ! I just train and the audio samples are very noisy now (approx 12 hours 25K epochs on single GPU, batch size 96,). Could you show me your train result? And when will the samples be good? Thanks!

janvainer · 2022-06-14T07:26:47Z

Hi, unfortunately I do not have the results with me anymore. But I remember training on 4 GPUs for several days.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio quality improvements #21

Audio quality improvements #21

janvainer commented Mar 28, 2021

ivanvovk commented Mar 28, 2021 •

edited

Loading

janvainer commented Mar 29, 2021

ivanvovk commented Mar 29, 2021

janvainer commented Apr 4, 2021

yijingshihenxiule commented Jun 14, 2022

janvainer commented Jun 14, 2022

Audio quality improvements #21

Audio quality improvements #21

Comments

janvainer commented Mar 28, 2021

ivanvovk commented Mar 28, 2021 • edited Loading

janvainer commented Mar 29, 2021

ivanvovk commented Mar 29, 2021

janvainer commented Apr 4, 2021

yijingshihenxiule commented Jun 14, 2022

janvainer commented Jun 14, 2022

ivanvovk commented Mar 28, 2021 •

edited

Loading