You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, great job with MetaVoice. Everything in the repository works as expected.
I went through the code to understand the 4 stage inference and correlate it with the documentation, and have a few questions regarding the choice of architecture/models. Please excuse me if these questions are naïve as I am new to speech synthesis.
You mention that you use GPT for the first stage. Is that GPT custom pretrained or you finetune it on top of any other publicly available version of GPT? Similar for the second stage model.
Why can all 8 hierarchies of EnCodec not be predicted together. Why do we need a second stage model?
In the third stage, why is MBD used when EnCodec itself can decode those EnCodec tokens and convert to waveform? Did MBD turn out to be better in your experiments?
Thanks for the amazing work.
The text was updated successfully, but these errors were encountered:
Hi team
First of all, great job with MetaVoice. Everything in the repository works as expected.
I went through the code to understand the 4 stage inference and correlate it with the documentation, and have a few questions regarding the choice of architecture/models. Please excuse me if these questions are naïve as I am new to speech synthesis.
Thanks for the amazing work.
The text was updated successfully, but these errors were encountered: