Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anonymization dataset and model format #10

Open
unilight opened this issue Jan 24, 2024 · 18 comments
Open

Anonymization dataset and model format #10

unilight opened this issue Jan 24, 2024 · 18 comments
Assignees

Comments

@unilight
Copy link

From the README (https://github.com/DigitalPhonetics/VoicePAT?tab=readme-ov-file#anonymization) it says that to anonymize my own data I should modify the following fields in the config file:

data_dir:    # path to original data in Kaldi-format for anonymization
results_dir: # path to location for all (intermediate) results of the anonymization
models_dir:  # path to models location

Just wondering what exactly is the Kaldi-format. I guess it refers to a text file with each line of the format <id> <wav path>, but just want to double-check.
The README also says:

Pretrained models for this anonymization can be found at https://github. com/DigitalPhonetics/speaker-anonymization/releases/tag/v2.0 and earlier releases.

But the link contains several zip files to download and it is very unclear what should be done here.

Would appreciate it if some more details could be provided. I totally understand this toolkit is under construction -- just raising my questions here.

@egaznep
Copy link
Collaborator

egaznep commented Jan 24, 2024 via email

@egaznep
Copy link
Collaborator

egaznep commented Jan 24, 2024

Regarding the first part of your question (Kaldi format),

Kaldi is a Swiss army knife for various speech processing tasks. Kaldi format is the dataset organization used by this toolkit, for a complete reference viewing https://kaldi-asr.org/doc/data_prep.html helps. Alternatively you can request access to the VoicePrivacy Challenge datasets, the instructions for this are available at https://github.com/Voice-Privacy-Challenge/Voice-Privacy-Challenge-2022.

This framework primarily utilizes the files
wav.scp: a text file, at each line containing space-separated UtteranceID and file path (or a command that dumps the binary file to stdout)
enrolls: a text file, at each line containing UtteranceIDs that will be used as enrollment utterances for ASV tasks
trials: a text file, at each line containing UtteranceIDs that will be used as trial utterances for ASV tasks
spk2gender: a text file, at each line containng space-separated SpeakerID and speaker gender (m: male, f: female)
utt2spk: a text file, at each line containing space-separated UtteranceID and SpeakerID
and few others that are similar to the last two I explained.

The function read_kaldi_format in utils.dataiowould give you some hints about how these files are parsed.

@egaznep egaznep self-assigned this Jan 24, 2024
@egaznep
Copy link
Collaborator

egaznep commented Jan 24, 2024

Regarding the second part of your question (installation), currently Bash script 01_download_data_model.sh is provided. I am working on consolidating these into a Makefile, and plan to submit a PR, to be approved by the project maintainers. This set of changes will also include a better documentation of these :)

@xiaoxiaomiao323
Copy link
Collaborator

Hi @unilight, thanks for the question. Yeah, the toolkit is still under construction and will update soon.

If you would like to use your own anonymized data, simply use the VoicePAT pretrained ASV and ASR models for evaluation:

  1. please prepare your own anonymized data folder anon_data_dir, which includes
Screenshot 2024-01-29 at 4 16 25 PM Screenshot 2024-01-29 at 4 21 04 PM
  1. run 00_install.sh and 01_download_data_model.sh
  2. modify the anon_data_dir filed in config/eval_pre_from_anon_datadir.yaml
  3. run python run_evaluation.py --config eval_pre_from_anon_datadir.yaml

@unilight
Copy link
Author

Thank you for the replies! So for now it would be better to use the vpc branch, right? (I guess all your instructions assume that I am on that branch)

please prepare your own anonymized data folder anon_data_dir, which includes...

Do you mean that I should prepare one or several folders each containing the wav files I want to anonymize?

@xiaoxiaomiao323
Copy link
Collaborator

So for now it would be better to use the vpc branch, right? (I guess all your instructions assume that I am on that branch)

Yes.

Do you mean that I should prepare one or several folders each containing the wav files I want to anonymize?

Yes, but currently, the toolkit only supports 12 dev+test datasets provided by VoicePrivacy challenge, These datasets include wav.scp/trial/utt2spk/spk2utt files, and they are also indicated in the config file, https://github.com/DigitalPhonetics/VoicePAT/blob/vpc/configs/eval_pre_from_anon_datadir.yaml#L7-L27

The first step of the evaluation script is to prepare wav.scp/trial/utt2spk/spk2utt for anonymized data and evaluation subdatasets. You can find the implementation details here https://github.com/DigitalPhonetics/VoicePAT/blob/vpc/run_evaluation.py#L154-L175

We strictly follow the VPC data design, I understand this is complicated at the beginning. Sorry for any confusion.

If you would like to use your own anonymized data instead of the VPC datasets, you will need to prepare your own data(including wav.scp/trial/utt2spk/spk2utt) and modify(or skip) https://github.com/DigitalPhonetics/VoicePAT/blob/vpc/run_evaluation.py#L154-L175.
We plan to support this function at a later time.

@unilight
Copy link
Author

@xiaoxiaomiao323 Thank you for the kind replies (and sorry for getting back so late).

So I tried running the VPC2022 baseline repo and I think I have a better understanding of how the directories should look right now.

I have another question. Say I have developed an anonymization system by myself and I want to use this toolkit to do evaluation on the VPC2022 dataset. After anonymizing the official 12 dev/test sets, is there anything else I should do other than just putting them in separate folders? Do I need to prepare the wav.scp/uttspk/...etc? Also can you suggest what modifications I should make to the config files?

@xiaoxiaomiao323
Copy link
Collaborator

I have another question. Say I have developed an anonymization system by myself and I want to use this toolkit to do evaluation on the VPC2022 dataset. After anonymizing the official 12 dev/test sets, is there anything else I should do other than just putting them in separate folders? Do I need to prepare the wav.scp/uttspk/...etc? Also can you suggest what modifications I should make to the config files?

Hi, glad to know! We updated the readme.
No need to prepare the wav.scp/utt2spk/..etc, the run_evaluation.py will create them automatically.

  1. prepare anonymized folders each containing the anonymized wav files (which you have done)
   libri_dev_enrolls/*wav
   libri_dev_trials_m/*wav
   libri_dev_trials_f/*wav

   libri_test_enrolls/*wav
   libri_test_trials_m/*wav
   libri_test_trials_f/*wav

train-clean-360/*wav ignore this if you don't train $ASR_{eval}^{anon}$, $ASV_{eval}^{anon}$

  1. modify entries in configs/eval_pre_from_anon_datadir.yaml and
    configs/eval_post_scratch_from_anon_datadir.yaml ( ignore this if you don't train $ASR_{eval}^{anon}$, $ASV_{eval}^{anon}$):
anon_data_dir: !PLACEHOLDER # TODO path to anonymized data (raw audios), e.g. <anon_data_dir>/libri_test_enrolls/*wav etc.
anon_data_suffix: !PLACEHOLDER  # suffix for dataset to signal that it is anonymized, e.g. b2, b1b, or gan

Noted: the VPC2024 plan (the challenge plan will release soon) to remove VCTK dev/test datasets, so the entry datasets in configs/eval_pre_from_anon_datadir.yaml is

datasets:
  - name: libri_dev
    data: libri
    set: dev
    enrolls: [enrolls]
    trials: [trials_f, trials_m]
  - name: libri_test
    data: libri
    set: test
    enrolls: [enrolls]
    trials: [trials_f, trials_m]

If you still want to include VCTK, please modify the entry datasets in
configs/eval_pre_from_anon_datadir.yaml to add vctk datasets.

datasets:
  - name: libri_dev
    data: libri
    set: dev
    enrolls: [enrolls]
    trials: [trials_f, trials_m]
  - name: libri_test
    data: libri
    set: test
    enrolls: [enrolls]
    trials: [trials_f, trials_m]
  - name: vctk_dev
    data: vctk
    set: dev
    enrolls: [enrolls]
    trials: [trials_f_all, trials_m_all]
  - name: vctk_test
    data: vctk
    set: test
    enrolls: [enrolls]
    trials: [trials_f_all, trials_m_all]
  1. perform evaluations
python run_evaluation.py --config eval_pre_from_anon_datadir.yaml

python run_evaluation.py --config eval_post_scratch_from_anon_datadir.yaml ignore this if you don't train $ASR_{eval}^{anon}$, $ASV_{eval}^{anon}$

@unilight
Copy link
Author

@xiaoxiaomiao323 Thank you for the reply!

I mainly want to align with the evaluation protocol in VPC2022, which, in my understanding, requires to train $ASR_{eval}^{anon}$ and $ASV_{eval}^{anon}$. Just wondering if this understanding is correct, and whether the training takes a long time...?

@xiaoxiaomiao323
Copy link
Collaborator

Just wondering if this understanding is correct, and whether the training takes a long time...?

No problem! After many testing runs, we found that this really depends on the hard drives.
For 1 V100: If you have an SSD or a high-performance drive, $ASV_{eval}^{anon}$ takes <3h, while $ASR_{eval}^{anon}$ takes ~35h.
But if the drive is old and slow, in a worse case, $ASV_{eval}^{anon}$ takes ~10h, $ASR_{eval}^{anon}$ takes ~150h
Increase num_workers in ASV training config and ASR training config may help to speed up the processing.

@unilight
Copy link
Author

@xiaoxiaomiao323 Thank you for the reply!

(I know you probably do not have the official answer but) just wondering whether training $ASR_{eval}^{anon}$ is really necessary. In my understanding, VPC adopted this process because it is claimed that the WER gets greatly improved, but if the anonymized speech cannot yield a good WER on an ASR model trained with natural speech, doesn't it mean that the anonymized speech is not natural enough? Or are there some other reasons?

@xiaoxiaomiao323
Copy link
Collaborator

(I know you probably do not have the official answer but) just wondering whether training $ASR_{eval}^{anon}$ is really necessary. In my understanding, VPC adopted this process because it is claimed that the WER gets greatly improved, but if the anonymized speech cannot yield a good WER on an ASR model trained with natural speech, doesn't it mean that the anonymized speech is not natural enough? Or are there some other reasons?

I agree with your opinion. Actually we already decided to not use $ASR_{eval}^{anon}$ for VPC2024 :)

@unilight
Copy link
Author

@xiaoxiaomiao323 I see, thank you for the answers! :)

So to understand the evaluation process I synced the latest vpc branch and ran 02_run.sh. Then I encountered the following error:

...
100%|█████████▉| 761/762 [02:19<00:00, 27.26it/s]
100%|██████████| 762/762 [02:19<00:00,  5.47it/s]
Done
Processing libri_train_360...
Traceback (most recent call last):
  File "run_anonymization_dsp.py", line 19, in <module>
    pipeline.run_anonymization_pipeline(datasets)
  File "/data/group1/z44476r/Experiments/VoicePAT/anonymization/pipelines/dsp_pipeline.py", line 34, in run_anonymization_pipeline
    process_data(dataset_path=self.libri_360_data_dir, 
  File "/data/group1/z44476r/Experiments/VoicePAT/anonymization/modules/dsp/anonymise_dir_mcadams_rand_seed.py", line 69, in process_data
    shutil.copytree(dataset_path, output_path)
  File "/home/z44476r/data/Experiments/VoicePAT/venv/lib/python3.8/shutil.py", line 555, in copytree
    with os.scandir(src) as itr:
FileNotFoundError: [Errno 2] No such file or directory: 'data/train-clean-360'
2024-02-20 16:39:05,022 - __main__- INFO - Preparing datadir according to the Kaldi format.
2024-02-20 16:39:06,516 - root- INFO - Perform ASV evaluation
...

which leads to error in python run_evaluation.py --config eval_post_scratch_from_anon_datadir.yaml.
However I do find the anonymized waveform files in data/train-clean-360-asv_dsp/wav/*.wav.
Is there some variables that need to be changed in the configs?

@xiaoxiaomiao323
Copy link
Collaborator

xiaoxiaomiao323 commented Feb 21, 2024

@unilight, sorry I think this is just a name mistake, when did you download "data.zip"? this should no problem if you download the new version of "data.zip" https://github.com/DigitalPhonetics/VoicePAT/blob/vpc/01_download_data_model.sh#L65
we used to use "train-clean-360-asv", but now we changed to "train-clean-360" without "-asv",
for now please rename

"data/train-clean-360-asv_dsp" -> "data/train-clean-360_dsp"

 "data/train-clean-360-asv" -> "data/train-clean-360"

And rerun.

@unilight
Copy link
Author

@xiaoxiaomiao323 Thank you for the reply! Yes, I indeed downloaded the files earlier (in late December). I'll rename and rerun, and let you know if I have more problems. (Sorry for running the scripts while you are still working in progress...)

@xiaoxiaomiao323
Copy link
Collaborator

@unilight , no problem, thanks for your patience. Feel free to post any questions.

@unilight
Copy link
Author

@xiaoxiaomiao323 I have finished running 02_run.sh. (FYI training the ASV model for 10 epochs on my server's single Tesla V100 32GM RAM took 18 hours. Maybe we have a not-so-efficient file system here...)

I am trying to interpret the numbers and compare them with those I find on the VoicePAT paper but I am having a hard time. I have the following results:

   dataset split gender enrollment     trial     EER
0    libri   dev      f   original  original  16.760
1    libri   dev      f   original      anon  13.744
2    libri   dev      f       anon      anon  13.070
3    libri   dev      m   original  original   1.710
4    libri   dev      m   original      anon   2.010
5    libri   dev      m       anon      anon   1.087
6    libri  test      f   original  original   8.074
7    libri  test      f   original      anon   8.210
8    libri  test      f       anon      anon   8.576
9    libri  test      m   original  original   1.800
10   libri  test      m   original      anon   2.197
11   libri  test      m       anon      anon   0.916
root - --- EER computation time: 10.313560 min ---

Can you kindly tell me which numbers to look?

Also I am very interested in GVD, and am wondering which variables in the config should I modify to get those GVD numbers?

@xiaoxiaomiao323
Copy link
Collaborator

xiaoxiaomiao323 commented Feb 22, 2024

(FYI training the ASV model for 10 epochs on my server's single Tesla V100 32GM RAM took 18 hours. Maybe we have a not-so-efficient file system here...)

Yeah other people also pointed it out. Really depends on I/O speed of the machine. Try increase num_workers next time, it helps.

For EER results:
1)$ASV_{eval}$ provided model trained using original LibriSpeech-360

enrollment     trial    
original       original   (unprotected->OO)
original       anon       (Ignorant->AA)
anon           anon       (Lazy-Informed->AA-lazy)

2)$ASV_{eval}^{anon}$ you trained using anonymized LibriSpeech-360

enrollment     trial    
anon           anon     (Semi-Informed->AA-semi)

The results you listed are obtained from $ASV_{eval}^{anon}$, so just need to look at the results

enrollment=anon trial=anon

And we didn't test DSP system when we wrote the VoicePAT paper.

which variables in the config should I modify to get those GVD numbers?

GVD computed by $ASV_{eval}$, so change the config
https://github.com/DigitalPhonetics/VoicePAT/blob/vpc/configs/eval_pre_from_anon_datadir.yaml

utility:
   - asr
   - gvd

comment asr, if you want to skip,

 utility:
    #- asr
    - gvd
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants