GitHub - modelscope/3D-Speaker: A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization

3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on ModelScope. Furthermore, we present a large-scale speech corpus also called 3D-Speaker to facilitate the research of speech representation disentanglement.

Quickstart

Install 3D-Speaker

git clone https://github.com/alibaba-damo-academy/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt

Running experiments

# Speaker verification: ERes2Net on 3D-Speaker dataset
cd egs/3dspeaker/sv-eres2net/
bash run.sh
# Speaker verification: ERes2NetV2 on 3D-Speaker dataset
cd egs/3dspeaker/sv-eres2netv2/
bash run.sh
# Speaker verification: CAM++ on 3D-Speaker dataset
cd egs/3dspeaker/sv-cam++/
bash run.sh
# Speaker verification: ECAPA-TDNN on 3D-Speaker dataset
cd egs/3dspeaker/sv-ecapa/
bash run.sh
# Self-supervised speaker verification: RDINO on 3D-Speaker dataset
cd egs/3dspeaker/sv-rdino/
bash run.sh
# Self-supervised speaker verification: SDPN on VoxCeleb dataset
cd egs/voxceleb/sv-sdpn/
bash run.sh
# Audio and multimodal Speaker diarization:
cd egs/3dspeaker/speaker-diarization/
bash run_audio.sh
bash run_video.sh
# Language identification
cd egs/3dspeaker/language-idenitfication
bash run.sh

Inference using pretrained models from Modelscope

All pretrained models are released on Modelscope.

# Install modelscope
pip install modelscope
# ERes2Net trained on 200k labeled speakers
model_id=iic/speech_eres2net_sv_zh-cn_16k-common
# ERes2NetV2 trained on 200k labeled speakers
model_id=iic/speech_eres2netv2_sv_zh-cn_16k-common
# CAM++ trained on 200k labeled speakers
model_id=iic/speech_campplus_sv_zh-cn_16k-common
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id
# Run batch inference
python speakerlab/bin/infer_sv_batch.py --model_id $model_id --wavs $wav_list

# SDPN trained on VoxCeleb
model_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k
# Run SDPN inference
python speakerlab/bin/infer_sv_ssl.py --model_id $model_id

# Run RDINO inference
model_id=damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k
python speakerlab/bin/infer_sv_ssl.py --model_id $model_id --yaml egs/voxceleb/sv-rdino/conf/rdino.yaml

# Run diarization inference
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir
# Enable overlap detection
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir --include_overlap --hf_access_token $hf_access_token

Overview of Content

Supervised Speaker Verification
- CAM++, ERes2Net, ERes2NetV2, ECAPA-TDNN, ResNet and Res2Net training recipes on 3D-Speaker.
- CAM++, ERes2Net, ERes2NetV2, ECAPA-TDNN, ResNet and Res2Net training recipes on VoxCeleb.
- CAM++, ERes2Net, ERes2NetV2, ECAPA-TDNN, ResNet and Res2Net training recipes on CN-Celeb.
Self-supervised Speaker Verification
- RDINO and SDPN training recipes on VoxCeleb
- RDINO training recipes on 3D-Speaker.
- RDINO training recipes on CN-Celeb.
Speaker Diarization
- Speaker diarization inference recipes which comprise multiple modules, including overlap detection[optional], voice activity detection, speech segmentation, speaker embedding extraction, and speaker clustering.
Language Identification
- Language identification training recipes on 3D-Speaker.
3D-Speaker Dataset
- Dataset introduction and download address: 3D-Speaker
- Related paper address: 3D-Speaker

What‘s new 🔥

[2024.12] Update diarization recipes and add results on multiple diarization benchmarks.
[2024.8] Releasing ERes2NetV2 and ERes2NetV2_w24s4ep4 pretrained models trained on 200k-speaker datasets.
[2024.5] Releasing X-vector model on VoxCeleb datasets.
[2024.5] Releasing SDPN model training and inference recipes for VoxCeleb.
[2024.5] Releasing visual module and semantic module training recipes.
[2024.4] Releasing ONNX Runtime and the relevant scripts for inference.
[2024.4] Releasing ERes2NetV2 model with lower parameters and faster inference speed on VoxCeleb datasets.
[2024.2] Releasing language identification integrating phonetic information recipes for more higher recognition accuracy.
[2024.2] Releasing multimodal diarization recipes which fuses audio and video image input to produce more accurate results.
[2024.1] Releasing ResNet34 and Res2Net model training and inference recipes for 3D-Speaker, VoxCeleb and CN-Celeb datasets.
[2024.1] Releasing large-margin finetune recipes in speaker verification and adding diarization recipes.
[2023.11] ERes2Net-base pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
[2023.10] Releasing ECAPA model training and inference recipes for three datasets.
[2023.9] Releasing RDINO model training and inference recipes for CN-Celeb.
[2023.8] Releasing CAM++, ERes2Net-Base and ERes2Net-Large benchmarks in CN-Celeb.
[2023.8] Releasing ERes2Net annd CAM++ in language identification for Mandarin and English.
[2023.7] Releasing CAM++, ERes2Net-Base, ERes2Net-Large pretrained models trained on 3D-Speaker.
[2023.7] Releasing Dialogue Detection and Semantic Speaker Change Detection in speaker diarization.
[2023.7] Releasing CAM++ in language identification for Mandarin and English.
[2023.6] Releasing 3D-Speaker dataset and its corresponding benchmarks including ERes2Net, CAM++ and RDINO.
[2023.5] ERes2Net pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
[2023.4] CAM++ pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.

Contact

If you have any comment or question about 3D-Speaker, please contact us by

email: {chenyafeng.cyf, zsq174630, tongmu.wh, shuli.cly}@alibaba-inc.com

License

3D-Speaker is released under the Apache License 2.0.

Acknowledge

3D-Speaker contains third-party components and code modified from some open-source repos, including:
Speechbrain, Wespeaker, D-TDNN, DINO, Vicreg, TalkNet-ASD , Ultra-Light-Fast-Generic-Face-Detector-1MB, pyannote.audio

Citations

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@article{chen2024eres2netv2,
  title={ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and and others},
  booktitle={INTERSPEECH},
  year={2024}
}
@article{chen2024sdpn,
  title={Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and others},
  url={https://arxiv.org/pdf/2308.02774},
  year={2024}
}
@article{chen20243d,
  title={3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and others},
  url={https://arxiv.org/pdf/2403.19971},
  year={2024}
}
@inproceedings{zheng20233d,
  title={3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement},
  author={Siqi Zheng, Luyao Cheng, Yafeng Chen, Hui Wang and Qian Chen},
  url={https://arxiv.org/pdf/2306.15354},
  year={2023}
}
@inproceedings{wang2023cam++,
  title={CAM++: A Fast and Efficient Network For Speaker Verification Using Context-Aware Masking},
  author={Wang, Hui and Zheng, Siqi and Chen, Yafeng and Cheng, Luyao and Chen, Qian},
  booktitle={INTERSPEECH},
  year={2023}
}
@inproceedings{chen2023enhanced,
  title={An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian and Qi, Jiajun},
  booktitle={INTERSPEECH},
  year={2023}
}
@inproceedings{chen2023pushing,
  title={Pushing the limits of self-supervised speaker verification using regularized distillation framework},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian},
  booktitle={ICASSP},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
docs/images		docs/images
egs		egs
runtime/onnxruntime		runtime/onnxruntime
speakerlab		speakerlab
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quickstart

Install 3D-Speaker

Running experiments

Inference using pretrained models from Modelscope

Overview of Content

What‘s new 🔥

Contact

License

Acknowledge

Citations

About

Releases

Packages

Contributors 8

Languages

License

modelscope/3D-Speaker

Folders and files

Latest commit

History

Repository files navigation

Quickstart

Install 3D-Speaker

Running experiments

Inference using pretrained models from Modelscope

Overview of Content

What‘s new 🔥

Contact

License

Acknowledge

Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages