Automated deployment: Tue Sep 3 14:44:32 UTC 2024 7ef894a

wenet-e2e · Sep 3, 2024 · 89f1574 · 89f1574
1 parent 7acdd99
commit 89f1574
Show file tree

Hide file tree

Showing 58 changed files with 821 additions and 15 deletions.
diff --git a/_modules/wespeaker/cli/utils.html b/_modules/wespeaker/cli/utils.html
@@ -126,6 +126,17 @@ <h1>Source code for wespeaker.cli.utils</h1><div class="highlight"><pre>
         <span class="n">action</span><span class="o">=</span><span class="s1">&#39;store_true&#39;</span><span class="p">,</span>
         <span class="n">help</span><span class="o">=</span><span class="s1">&#39;whether to use the damo/speech_eres2net_sv_zh-cn_16k-common model&#39;</span>
     <span class="p">)</span>
+    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span>
+        <span class="s1">&#39;--vblinkp&#39;</span><span class="p">,</span>
+        <span class="n">action</span><span class="o">=</span><span class="s1">&#39;store_true&#39;</span><span class="p">,</span>
+        <span class="n">help</span><span class="o">=</span><span class="s1">&#39;whether to use the samresnet34 model pretrained on voxblink2&#39;</span>
+    <span class="p">)</span>
+    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span>
+        <span class="s1">&#39;--vblinkf&#39;</span><span class="p">,</span>
+        <span class="n">action</span><span class="o">=</span><span class="s1">&#39;store_true&#39;</span><span class="p">,</span>
+        <span class="n">help</span><span class="o">=</span><span class="s2">&quot;whether to use the samresnet34 model pretrained on voxblink2 and&quot;</span>
+             <span class="s2">&quot;fintuned on voxceleb2&quot;</span>
+    <span class="p">)</span>
     <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">&#39;-p&#39;</span><span class="p">,</span>
                         <span class="s1">&#39;--pretrain&#39;</span><span class="p">,</span>
                         <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span>

diff --git a/_sources/papers_using_wespeaker.md.txt b/_sources/papers_using_wespeaker.md.txt
@@ -0,0 +1,213 @@
+# Papers Implemented in WeSpeaker
+
+[TOC]
+
+## Stay Tuned! (Need to add a introduction for each paper)
+
+## Introduction
+
+After the release of the WeSpeaker project, many users from both academia and industry have actively engaged with it in their research. We appreciate all the feedback and contributions from the community and would like to highlight these interesting works.
+
+Besides the citation of WeSpeaker itself, we highly recommend you to read and cite the corresponding papers as listed below.
+
+```bibtex
+@article{wang2024advancing,
+  title={Advancing speaker embedding learning: Wespeaker toolkit for research and production},
+  author={Wang, Shuai and Chen, Zhengyang and Han, Bing and Wang, Hongji and Liang, Chengdong and Zhang, Binbin and Xiang, Xu and Ding, Wen and Rohdin, Johan and Silnova, Anna and others},
+  journal={Speech Communication},
+  volume={162},
+  pages={103104},
+  year={2024},
+  publisher={Elsevier}
+}
+
+@inproceedings{wang2023wespeaker,
+  title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
+  author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
+  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={1--5},
+  year={2023},
+  organization={IEEE}
+}
+```
+
+## Architecture
+
+### TDNN
+
+```bibtex
+@inproceedings{snyder2018x,
+  title={X-vectors: Robust dnn embeddings for speaker recognition},
+  author={Snyder, David and Garcia-Romero, Daniel and Sell, Gregory and Povey, Daniel and Khudanpur, Sanjeev},
+  booktitle={2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)},
+  pages={5329--5333},
+  year={2018},
+  organization={IEEE}
+}
+```
+
+### ECAPA-TDNN
+
+```bibtex
+@article{desplanques2020ecapa,
+  title={Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification},
+  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
+  journal={arXiv preprint arXiv:2005.07143},
+  year={2020}
+}
+```
+
+### ResNet
+
+The Current ResNet implementation is based on our system for VoxSRC2019, it's also the default speaker model in Pyannote.audio diarization pipeline (https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM)
+
+```bibtex
+@article{zeinali2019but,
+  title={But system description to voxceleb speaker recognition challenge 2019},
+  author={Zeinali, Hossein and Wang, Shuai and Silnova, Anna and Mat{\v{e}}jka, Pavel and Plchot, Old{\v{r}}ich},
+  journal={arXiv preprint arXiv:1910.12592},
+  year={2019}
+}
+```
+
+### ReDimNet
+
+>
+
+```bibtex
+@article{yakovlev2024reshape,
+  title={Reshape Dimensions Network for Speaker Recognition},
+  author={Yakovlev, Ivan and Makarov, Rostislav and Balykin, Andrei and Malov, Pavel and Okhotnikov, Anton and Torgashov, Nikita},
+  journal={arXiv preprint arXiv:2407.18223},
+  year={2024}
+}
+```
+
+### Golden gemini DF-ResNet
+
+```bibtex
+@article{liu2024golden,
+  title={Golden gemini is all you need: Finding the sweet spots for speaker verification},
+  author={Liu, Tianchi and Lee, Kong Aik and Wang, Qiongqiong and Li, Haizhou},
+  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
+  year={2024},
+  publisher={IEEE}
+}
+```
+
+### SimAM-ResNet
+
+```bibtex
+@inproceedings{qin2022simple,
+  title={Simple attention module based speaker verification with iterative noisy label detection},
+  author={Qin, Xiaoyi and Li, Na and Weng, Chao and Su, Dan and Li, Ming},
+  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={6722--6726},
+  year={2022},
+  organization={IEEE}
+}
+```
+
+### Whisper based Speaker Verification
+
+```bibtex
+@article{zhao2024whisperpmfapartialmultiscalefeature,
+      title={Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models},
+      author={Yiyang Zhao and Shuai Wang and Guangzhi Sun and Zehua Chen and Chao Zhang and Mingxing Xu and Thomas Fang Zheng},
+      year={2024},
+      eprint={2408.15585},
+      archivePrefix={arXiv},
+      primaryClass={cs.SD},
+      url={https://arxiv.org/abs/2408.15585},
+}
+```
+
+### CAM++
+
+```bibtex
+@article{wang2023cam++,
+  title={Cam++: A fast and efficient network for speaker verification using context-aware masking},
+  author={Wang, Hui and Zheng, Siqi and Chen, Yafeng and Cheng, Luyao and Chen, Qian},
+  journal={arXiv preprint arXiv:2303.00332},
+  year={2023}
+}
+```
+
+### ERes2Net
+
+```bibtex
+@article{chen2023enhanced,
+  title={An enhanced res2net with local and global feature fusion for speaker verification},
+  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian and Qi, Jiajun},
+  journal={arXiv preprint arXiv:2305.12838},
+  year={2023}
+}
+```
+
+## Pipelines
+
+### DINO Pretraining with Large-scale Data
+
+```bibtex
+@inproceedings{wang2024leveraging,
+  title={Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition},
+  author={Wang, Shuai and Bai, Qibing and Liu, Qi and Yu, Jianwei and Chen, Zhengyang and Han, Bing and Qian, Yanmin and Li, Haizhou},
+  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={10901--10905},
+  year={2024},
+  organization={IEEE}
+}
+```
+
+## Dataset
+
+### VoxBlink
+
+```bibtex
+@inproceedings{lin2024voxblink,
+  title={Voxblink: A large scale speaker verification dataset on camera},
+  author={Lin, Yuke and Qin, Xiaoyi and Zhao, Guoqing and Cheng, Ming and Jiang, Ning and Wu, Haiying and Li, Ming},
+  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={10271--10275},
+  year={2024},
+  organization={IEEE}
+}
+
+@article{lin2024voxblink2,
+  title={VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark},
+  author={Lin, Yuke and Cheng, Ming and Zhang, Fulin and Gao, Yingying and Zhang, Shilei and Li, Ming},
+  journal={arXiv preprint arXiv:2407.11510},
+  year={2024}
+}
+```
+
+### VoxCeleb
+
+```bibtex
+@article{nagrani2017voxceleb,
+  title={Voxceleb: a large-scale speaker identification dataset},
+  author={Nagrani, Arsha and Chung, Joon Son and Zisserman, Andrew},
+  journal={arXiv preprint arXiv:1706.08612},
+  year={2017}
+}
+
+@article{chung2018voxceleb2,
+  title={Voxceleb2: Deep speaker recognition},
+  author={Chung, Joon Son and Nagrani, Arsha and Zisserman, Andrew},
+  journal={arXiv preprint arXiv:1806.05622},
+  year={2018}
+}
+```
+
+### CNCeleb
+
+```bibtex
+@inproceedings{fan2020cn,
+  title={Cn-celeb: a challenging chinese speaker recognition dataset},
+  author={Fan, Yue and Kang, JW and Li, LT and Li, KC and Chen, HL and Cheng, ST and Zhang, PY and Zhou, ZY and Cai, YQ and Wang, Dong},
+  booktitle={ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={7604--7608},
+  year={2020},
+  organization={IEEE}
+}
+```
diff --git a/_sources/pretrained.md.txt b/_sources/pretrained.md.txt
@@ -54,7 +54,10 @@ The model with suffix **LM** means that it is further fine-tuned using large-mar
 | [VoxCeleb](../examples/voxceleb/v2/README.md) | EN        | [ECAPA1024](https://wenet.org.cn/downloads?models=wespeaker&version=voxceleb_ECAPA1024.zip) / [ECAPA1024_LM](https://wenet.org.cn/downloads?models=wespeaker&version=voxceleb_ECAPA1024_LM.zip) | [ECAPA1024](https://wenet.org.cn/downloads?models=wespeaker&version=voxceleb_ECAPA1024.onnx) / [ECAPA1024_LM](https://wenet.org.cn/downloads?models=wespeaker&version=voxceleb_ECAPA1024_LM.onnx) |
 | [VoxCeleb](../examples/voxceleb/v2/README.md)   | EN    | [Gemini_DFResnet114_LM](https://wenet.org.cn/downloads?models=wespeaker&version=voxceleb_gemini_dfresnet114_LM.zip)| [Gemini_DFResnet114_LM](https://wenet.org.cn/downloads?models=wespeaker&version=voxceleb_gemini_dfresnet114_LM.onnx)  |
 | [CNCeleb](../examples/cnceleb/v2/README.md)   | CN        | [ResNet34](https://wenet.org.cn/downloads?models=wespeaker&version=cnceleb_resnet34.zip) / [ResNet34_LM](https://wenet.org.cn/downloads?models=wespeaker&version=cnceleb_resnet34_LM.zip)      | [ResNet34](https://wenet.org.cn/downloads?models=wespeaker&version=cnceleb_resnet34.onnx) / [ResNet34_LM](https://wenet.org.cn/downloads?models=wespeaker&version=cnceleb_resnet34_LM.onnx)         |
-
+| [VoxBlink2](../examples/voxceleb/v2/README.md) | Multilingual        | [SimAMResNet34](https://wenet.org.cn/downloads?models=wespeaker&version=voxblink2_samresnet34.zip)  | [SimAMResNet34](https://wenet.org.cn/downloads?models=wespeaker&version=voxblink2_samresnet34.onnx)      |
+| [VoxBlink2 (pretrain) + VoxCeleb2 (finetune)](../examples/voxceleb/v2/README.md) | Multilingual        | [SimAMResNet34](https://wenet.org.cn/downloads?models=wespeaker&version=voxblink2_samresnet34_ft.zip)  |[SimAMResNet34](https://wenet.org.cn/downloads?models=wespeaker&version=voxblink2_samresnet34_ft.onnx)  |
+| [VoxBlink2](../examples/voxceleb/v2/README.md) | Multilingual        | [SimAMResNet100](https://wenet.org.cn/downloads?models=wespeaker&version=voxblink2_samresnet100.zip)  |[SimAMResNet100](https://wenet.org.cn/downloads?models=wespeaker&version=voxblink2_samresnet100.onnx)|
+| [VoxBlink2 (pretrain) + VoxCeleb2 (finetune)](../examples/voxceleb/v2/README.md) | Multilingual        | [SimAMResNet100](https://wenet.org.cn/downloads?models=wespeaker&version=voxblink2_samresnet100_ft.zip)  |[SimAMResNet100](https://wenet.org.cn/downloads?models=wespeaker&version=voxblink2_samresnet100_ft.onnx) |
 ### huggingface
 
 | Datasets                                      | Languages | Checkpoint (pt)                                                                                                                                                                                                                     | Runtime Model (onnx)                                                                                                                                                                                                                  |

diff --git a/_sources/python_api/wespeaker.models.rst.txt b/_sources/python_api/wespeaker.models.rst.txt
@@ -23,6 +23,7 @@ Submodules
    wespeaker.models.repvgg
    wespeaker.models.res2net
    wespeaker.models.resnet
+   wespeaker.models.samresnet
    wespeaker.models.speaker_model
    wespeaker.models.tdnn
    wespeaker.models.whisper_PMFA
diff --git a/_sources/python_api/wespeaker.models.samresnet.rst.txt b/_sources/python_api/wespeaker.models.samresnet.rst.txt
@@ -0,0 +1,7 @@
+wespeaker.models.samresnet module
+=================================
+
+.. automodule:: wespeaker.models.samresnet
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/_sources/python_package.md.txt b/_sources/python_package.md.txt
@@ -40,6 +40,8 @@ You can specify the following parameters. (use `-h` for details)
   use [`campplus_cn_common_200k` of damo](https://www.modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary)
 * `--eres2net`:
   use [`res2net_cn_common_200k` of damo](https://www.modelscope.cn/models/iic/speech_eres2net_sv_zh-cn_16k-common/summary)
+* `--vblinkp`: use the sam_resnet34 model pretrained on VoxBlink2
+* `--vblinkf`: use the sam_resnet34 model pretrained on VoxBlink2 and finetuned on VoxCeleb2
 * `--audio_file`: input audio file path
 * `--audio_file2`: input audio file2 path, specifically for the similarity task
 * `--wav_scp`: input wav.scp file in kaldi format (each line: key wav_path)

diff --git a/_sources/reference.rst.txt b/_sources/reference.rst.txt
@@ -7,4 +7,5 @@ Reference
 
    ./paper.md
    ./speaker_recognition_papers.md
+   ./papers_using_wespeaker.md
    ./python_api/modules.rst
diff --git a/index.html b/index.html
@@ -111,6 +111,7 @@ <h1>Welcome to Wespeaker’s documentation!<a class="headerlink" href="#welcome-
 <li class="toctree-l1"><a class="reference internal" href="reference.html">Reference</a><ul>
 <li class="toctree-l2"><a class="reference internal" href="paper.html">Wespeaker Papers</a></li>
 <li class="toctree-l2"><a class="reference internal" href="speaker_recognition_papers.html">Speaker Recognition Papers</a></li>
+<li class="toctree-l2"><a class="reference internal" href="papers_using_wespeaker.html">Papers Implemented in WeSpeaker</a></li>
 <li class="toctree-l2"><a class="reference internal" href="python_api/modules.html">Python API Reference</a></li>
 </ul>
 </li>

diff --git a/objects.inv b/objects.inv
diff --git a/paper.html b/paper.html
@@ -54,6 +54,7 @@
 <li class="toctree-l1 current"><a class="reference internal" href="reference.html">Reference</a><ul class="current">
 <li class="toctree-l2 current"><a class="current reference internal" href="#">Wespeaker Papers</a></li>
 <li class="toctree-l2"><a class="reference internal" href="speaker_recognition_papers.html">Speaker Recognition Papers</a></li>
+<li class="toctree-l2"><a class="reference internal" href="papers_using_wespeaker.html">Papers Implemented in WeSpeaker</a></li>
 <li class="toctree-l2"><a class="reference internal" href="python_api/modules.html">Python API Reference</a></li>
 </ul>
 </li>