Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF Gradio demo: sudden gender flip for slider #202

Open
Pendrokar opened this issue Oct 24, 2024 · 3 comments
Open

HF Gradio demo: sudden gender flip for slider #202

Pendrokar opened this issue Oct 24, 2024 · 3 comments

Comments

@Pendrokar
Copy link

I've added Toucan to the TTS Arena fork by using the MassivelyMultilingualTTS space.
Arena: https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena
TTS Space: https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS

After some time the "Gender of artificial Voice" slider values are flipped. I take it it always was meant to mean that -10 is the lowest average pitch and +10 the highest. Therefore it is a male/female slider in that order. Yet it sometimes flips in reverse.

Is something in the model reconfiguring?

Right now, a positive value means male gender on the space.

@Flux9665
Copy link
Collaborator

Hi, thanks for including Toucan in the Arena!

The gender slider is not related to the pitch, it specifies a rotation around a principal component axis in the latent space of the speaker embedding generator.

If no voice reference is given, the system will use an artificial speaker embedding that is not linked to any real human, but is instead generated by a GAN that learned to match the distribution of speaker embeddings. This generation process can be manipulated by this rotation. The direction of the rotation is not always the same, since a generated artificial speaker embedding might be flipped upside-down through a rotation on another axis. So the slider does not have a static direction, we can never know if the slider is masculine or feminine to the left or the right. It is different for every speaker embedding, and a new set of speaker embeddings is generated with every restart of the space. So every day there are new voices.

For the arena, it's probably a good idea to keep the speaker always the same, right? I can make the random seed static, then we always have the same voices. Or, since the arena only supports English, I can make a separate space from which you can use the API that uses the real default embedding and not a generated artificial one.

@Pendrokar
Copy link
Author

Pendrokar commented Oct 25, 2024

Ok, so I am not going crazy.

Also cloning never works for me. It still seems to take the generated artifical speaker.

I am thinking of using multiple voices and languages for the arena in the future. But for now it is a single female American-English voice.

So I would still need a more deterministic outcome.

[edit]
As Toucan is being rejected even in favor of the lowest ranked models such as OpenVoice2 and WhisperSpeech.
https://huggingface.co/datasets/Pendrokar/TTS_Arena/viewer/default/train?f[rejected][value]=%27Flux9665/MassivelyMultilingualTTS%27

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 25, 2024

I made a space that you can use for this. It features just a female American English voice and the inputs are greatly simplified, it's just the text and nothing else.

https://huggingface.co/spaces/Flux9665/EnglishToucan

Without the artificial speaker embeddings, I'm expecting much better and much more consistent results, that more accurately reflect what the model is capable of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants