Why we need to finetune on Tacotron output? #147

JiachuanDENG · 2023-08-01T01:25:58Z

May I ask why we need to finetune on Tacotron output? Given that we can get the ground truth Mel-spect from the original wavform audio, why bother trying to learn acting like Tacotron ? Can anyone give me a intuitive explanation?

jasonlilley · 2023-08-14T18:26:12Z

@JiachuanDENG See section 4.4 in their paper. When they ran the original hifigan model (not fine-tuned) on the output of Tacotron2, the quality was good but not good enough. When they looked at the errors, they concluded that most of the error was coming from Tacotron2, not the vocoder. So the idea of fine-tuning on the output of the front-end is that the vocoder will learn to correct the errors of the front-end. If you train on the ground truth, you may not be able to correct the front-end errors. Of course, if you intend to synthesize using a different front-end, you should train on the output of that front-end, not Tacotron.

Personally, I would have liked to see an experiment where they just fine-tune on the ground-truth of the target speaker as you suggested and compared the output to the experiment they ran. But I trust that their conclusion is correct. I'm going to try to run my own experiments this week and see what happens (using FS2, not Tacotron).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why we need to finetune on Tacotron output? #147

Why we need to finetune on Tacotron output? #147

JiachuanDENG commented Aug 1, 2023

jasonlilley commented Aug 14, 2023

Why we need to finetune on Tacotron output? #147

Why we need to finetune on Tacotron output? #147

Comments

JiachuanDENG commented Aug 1, 2023

jasonlilley commented Aug 14, 2023