Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why we need to finetune on Tacotron output? #147

Open
JiachuanDENG opened this issue Aug 1, 2023 · 1 comment
Open

Why we need to finetune on Tacotron output? #147

JiachuanDENG opened this issue Aug 1, 2023 · 1 comment

Comments

@JiachuanDENG
Copy link

May I ask why we need to finetune on Tacotron output? Given that we can get the ground truth Mel-spect from the original wavform audio, why bother trying to learn acting like Tacotron ? Can anyone give me a intuitive explanation?

@jasonlilley
Copy link

@JiachuanDENG See section 4.4 in their paper. When they ran the original hifigan model (not fine-tuned) on the output of Tacotron2, the quality was good but not good enough. When they looked at the errors, they concluded that most of the error was coming from Tacotron2, not the vocoder. So the idea of fine-tuning on the output of the front-end is that the vocoder will learn to correct the errors of the front-end. If you train on the ground truth, you may not be able to correct the front-end errors. Of course, if you intend to synthesize using a different front-end, you should train on the output of that front-end, not Tacotron.

Personally, I would have liked to see an experiment where they just fine-tune on the ground-truth of the target speaker as you suggested and compared the output to the experiment they ran. But I trust that their conclusion is correct. I'm going to try to run my own experiments this week and see what happens (using FS2, not Tacotron).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants