Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some common questions about our model. #5

Open
superFilicos opened this issue Sep 2, 2024 · 2 comments
Open

Some common questions about our model. #5

superFilicos opened this issue Sep 2, 2024 · 2 comments

Comments

@superFilicos
Copy link

Q: "I have four questions that I would like to confirm or discuss:

--- Does this model have the capability for streaming TTS? I only saw streaming audio tokens mentioned, so is this Encodec (SNAC) streaming capable? I remember Encodec supports streaming, but the paper doesn't explicitly state this.
--- From the paper, it seems that tasks like speech recognition (understanding) use continuous representations, while tasks like speech synthesis (generation) use discrete representations. I think this is reasonable.
--- Why in Figure 2 are the text sequence length and audio token sequence length the same? Shouldn't the audio tokens be much longer than the text sequence? In Section 3, there's a sentence: 'Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first.' Is this alignment at the character level or the sentence level?
--- Figure 3 seems to have an issue; there should be a task corresponding to TTS, right? But I don't see the input text for TTS in the model's input layer."

@superFilicos
Copy link
Author

Q in Chinese
我有4个问题想确认或讨论下:

  1. 这个模型有streaming TTS的能力么?我只看到了streaming audio token,所以这个encodec (SNAC)是streaming的么?我记得encodec是支持streaming能力的,但文章没明确说
  2. 从文章来看,语音识别等理解任务是用连续表征,语音合成等生成任务是用离散表征吧, 我觉得这个是合理的
  3. 为什么在Figure 2里,文本序列的长度和audio token序列长度一样呢?不是应该audio token远大于文本序列长度么? Section 3里有这么一句话: Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first, 这个是是在每个字层面对齐么?还是句子层面?
  4. Figure 3貌似有点问题,应该还有一个对应TTS的任务吧? 但TTS的输入文本在模型的输入层没看到

@superFilicos
Copy link
Author

A :

  1. We initially tested it, and found that Snac tokens can be decoded in pairs, resulting in smooth audio without any lag. The more pairs, the better.
  2. That's right, but we actually embed them together. My text has a pad sequence equal in length to the audio, so I thought it would be better to represent it as discrete.
  3. No, they're not the same, there's an ellipsis. That 'N' is configurable. We set it as a sequence incrementing by 1. It's actually at the character level.
  4. There is no TTS task. Figure 3 only illustrates the token sequence, and the loss is calculated over this sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant