Some common questions about our model. #5

superFilicos · 2024-09-02T12:45:47Z

Q: "I have four questions that I would like to confirm or discuss:

--- Does this model have the capability for streaming TTS? I only saw streaming audio tokens mentioned, so is this Encodec (SNAC) streaming capable? I remember Encodec supports streaming, but the paper doesn't explicitly state this.
--- From the paper, it seems that tasks like speech recognition (understanding) use continuous representations, while tasks like speech synthesis (generation) use discrete representations. I think this is reasonable.
--- Why in Figure 2 are the text sequence length and audio token sequence length the same? Shouldn't the audio tokens be much longer than the text sequence? In Section 3, there's a sentence: 'Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first.' Is this alignment at the character level or the sentence level?
--- Figure 3 seems to have an issue; there should be a task corresponding to TTS, right? But I don't see the input text for TTS in the model's input layer."

superFilicos · 2024-09-02T12:46:10Z

Q in Chinese
我有4个问题想确认或讨论下：

这个模型有streaming TTS的能力么？我只看到了streaming audio token，所以这个encodec （SNAC）是streaming的么？我记得encodec是支持streaming能力的，但文章没明确说
从文章来看，语音识别等理解任务是用连续表征，语音合成等生成任务是用离散表征吧，我觉得这个是合理的
为什么在Figure 2里，文本序列的长度和audio token序列长度一样呢？不是应该audio token远大于文本序列长度么？ Section 3里有这么一句话： Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first，这个是是在每个字层面对齐么？还是句子层面？
Figure 3貌似有点问题，应该还有一个对应TTS的任务吧？但TTS的输入文本在模型的输入层没看到

superFilicos · 2024-09-02T12:48:45Z

A ：

We initially tested it, and found that Snac tokens can be decoded in pairs, resulting in smooth audio without any lag. The more pairs, the better.
That's right, but we actually embed them together. My text has a pad sequence equal in length to the audio, so I thought it would be better to represent it as discrete.
No, they're not the same, there's an ellipsis. That 'N' is configurable. We set it as a sequence incrementing by 1. It's actually at the character level.
There is no TTS task. Figure 3 only illustrates the token sequence, and the loss is calculated over this sequence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some common questions about our model. #5

Some common questions about our model. #5

superFilicos commented Sep 2, 2024

superFilicos commented Sep 2, 2024

superFilicos commented Sep 2, 2024

Some common questions about our model. #5

Some common questions about our model. #5

Comments

superFilicos commented Sep 2, 2024

superFilicos commented Sep 2, 2024

superFilicos commented Sep 2, 2024