-
Notifications
You must be signed in to change notification settings - Fork 62
Can't reproduce the redult of the paper #12
Comments
Hi @zz12375, I am also not able to reproduce the results. Could you please let me know what is the PER you are getting on fr alone? |
Evaluated on 300 epoch pretraining model, the PER tuned with 5-hour fr is 46.49% (frozen features) and 40.05% (finetuning features). |
Thanks. Here are some PERs on the french test set: This model:
Wav2Vec [Paper]:
ConvDMM [Paper]
Note: I did not use this toolkit to get PER, but I wanted to be consistent with the CTC phone classifier that this repo is using so I copied the following class in my codebase: CPC_audio/cpc/eval/common_voices_eval.py Line 128 in b98a1bd
In your case, maybe increase the number of epochs |
Thank you very much for sharing your results. You mean increasing the number of epochs during the down-stream finetuning stage (CTC loss), or pretraining stage? |
I meant during downstream CTC phone classifier training. |
Hi, team.
I am very greatful you provide the code and data splits for your CPC audio paper "https: //arXiv.org/abs/2002.02848".
First I tried to pretrain Mod. CPC on libri-100 and frozen the features for common voice 1-hour ASR task, I got avg per of 45.2% on 5 languages (es, fr, it, ru, tt), which is reported as 43.9% in your paper (Table 3), I think my results is close (-1.3%) to what you reported, which seemed reasonable.
But when I test the pre-trained features on 5-hour common voice ASR tasks (es, fr, it, ru, tt), I just got a avg per (frozen features) of 42.5%, which had a big gap (-3.7%) with the reported per (38.8%, Table 5 in paper); when finetuning features, the gap was even bigger, the avg per was 37.2% (in the paper it is reported as 31.0%).
Unfortunately, the 5-hour common voice ASR experiments also perform badly when training from scratch, a avg per of 43.2%, far behind 38.3% reported in your paper.
I will be very thankful if you kindly provide more detailed hyper-parameters to help me reproduce your results.
Especially, I noticed you have set a optional argument --LSTM in ./eval/common_voices_eval.py to add a LSTM layer before the linear softmax layer. I think it would significantly increase the model capacity and may lead to better performance, did you use it?
Thnk you very much!
For now I used the default hyper-parameters on common voice ASR transfer experiments:
--batchSize 8
--lr 2e-4
--nEpoch 30
--kernelSize 8
......
The text was updated successfully, but these errors were encountered: