You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Q: "I have four questions that I would like to confirm or discuss:
--- Does this model have the capability for streaming TTS? I only saw streaming audio tokens mentioned, so is this Encodec (SNAC) streaming capable? I remember Encodec supports streaming, but the paper doesn't explicitly state this.
--- From the paper, it seems that tasks like speech recognition (understanding) use continuous representations, while tasks like speech synthesis (generation) use discrete representations. I think this is reasonable.
--- Why in Figure 2 are the text sequence length and audio token sequence length the same? Shouldn't the audio tokens be much longer than the text sequence? In Section 3, there's a sentence: 'Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first.' Is this alignment at the character level or the sentence level?
--- Figure 3 seems to have an issue; there should be a task corresponding to TTS, right? But I don't see the input text for TTS in the model's input layer."
The text was updated successfully, but these errors were encountered:
为什么在Figure 2里,文本序列的长度和audio token序列长度一样呢?不是应该audio token远大于文本序列长度么? Section 3里有这么一句话: Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first, 这个是是在每个字层面对齐么?还是句子层面?
We initially tested it, and found that Snac tokens can be decoded in pairs, resulting in smooth audio without any lag. The more pairs, the better.
That's right, but we actually embed them together. My text has a pad sequence equal in length to the audio, so I thought it would be better to represent it as discrete.
No, they're not the same, there's an ellipsis. That 'N' is configurable. We set it as a sequence incrementing by 1. It's actually at the character level.
There is no TTS task. Figure 3 only illustrates the token sequence, and the loss is calculated over this sequence.
Q: "I have four questions that I would like to confirm or discuss:
--- Does this model have the capability for streaming TTS? I only saw streaming audio tokens mentioned, so is this Encodec (SNAC) streaming capable? I remember Encodec supports streaming, but the paper doesn't explicitly state this.
--- From the paper, it seems that tasks like speech recognition (understanding) use continuous representations, while tasks like speech synthesis (generation) use discrete representations. I think this is reasonable.
--- Why in Figure 2 are the text sequence length and audio token sequence length the same? Shouldn't the audio tokens be much longer than the text sequence? In Section 3, there's a sentence: 'Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first.' Is this alignment at the character level or the sentence level?
--- Figure 3 seems to have an issue; there should be a task corresponding to TTS, right? But I don't see the input text for TTS in the model's input layer."
The text was updated successfully, but these errors were encountered: