Questions regarding architecture #159

karandua2016 · 2024-05-23T09:30:07Z

Hi team

First of all, great job with MetaVoice. Everything in the repository works as expected.

I went through the code to understand the 4 stage inference and correlate it with the documentation, and have a few questions regarding the choice of architecture/models. Please excuse me if these questions are naïve as I am new to speech synthesis.

You mention that you use GPT for the first stage. Is that GPT custom pretrained or you finetune it on top of any other publicly available version of GPT? Similar for the second stage model.
Why can all 8 hierarchies of EnCodec not be predicted together. Why do we need a second stage model?
In the third stage, why is MBD used when EnCodec itself can decode those EnCodec tokens and convert to waveform? Did MBD turn out to be better in your experiments?

Thanks for the amazing work.

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding architecture #159

Questions regarding architecture #159

karandua2016 commented May 23, 2024 •

edited

Loading

This comment has been minimized.

Questions regarding architecture #159

Questions regarding architecture #159

Comments

karandua2016 commented May 23, 2024 • edited Loading

This comment has been minimized.

karandua2016 commented May 23, 2024 •

edited

Loading