Model Card: VALL-E X

Author: Songting.

This is the official codebase for running open-sourced VALL-E X.

The following is additional information about the models released here.

Model Details

VALL-E X is a series of two transformer models that turn text into audio.

Phoneme to acoustic tokens

Input: IPAs converted from input text by a rule-based G2P tool.
Output: tokens from the first codebook of the EnCodec Codec from facebook

Coarse to fine tokens

Input: IPAs converted from input text by a rule-based G2P tool & the first codebook from EnCodec
Output: 8 codebooks from EnCodec

Architecture

Model	Parameters	Attention	Output Vocab size
G2P tool	-	-	69
Phoneme to coarse tokens	150 M	Causal	1x 1,024
Coarse to fine tokens	150 M	Non-causal	7x 1,024

Release date

August 2023

Broader Implications

We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages. Straightforward improvements will allow models to run faster than realtime, rendering them useful for applications such as virtual assistants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model-card.md

model-card.md

Model Card: VALL-E X

Model Details

Phoneme to acoustic tokens

Coarse to fine tokens

Architecture

Release date

Broader Implications

Files

model-card.md

Latest commit

History

model-card.md

File metadata and controls

Model Card: VALL-E X

Model Details

Phoneme to acoustic tokens

Coarse to fine tokens

Architecture

Release date

Broader Implications