Author: Songting.
This is the official codebase for running open-sourced VALL-E X.
The following is additional information about the models released here.
VALL-E X is a series of two transformer models that turn text into audio.
- Input: IPAs converted from input text by a rule-based G2P tool.
- Output: tokens from the first codebook of the EnCodec Codec from facebook
- Input: IPAs converted from input text by a rule-based G2P tool & the first codebook from EnCodec
- Output: 8 codebooks from EnCodec
Model | Parameters | Attention | Output Vocab size |
---|---|---|---|
G2P tool | - | - | 69 |
Phoneme to coarse tokens | 150 M | Causal | 1x 1,024 |
Coarse to fine tokens | 150 M | Non-causal | 7x 1,024 |
August 2023
We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages. Straightforward improvements will allow models to run faster than realtime, rendering them useful for applications such as virtual assistants.