Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Can't reproduce the redult of the paper #12

Open
zz12375 opened this issue Jul 30, 2020 · 5 comments
Open

Can't reproduce the redult of the paper #12

zz12375 opened this issue Jul 30, 2020 · 5 comments

Comments

@zz12375
Copy link

zz12375 commented Jul 30, 2020

Hi, team.
I am very greatful you provide the code and data splits for your CPC audio paper "https: //arXiv.org/abs/2002.02848".

First I tried to pretrain Mod. CPC on libri-100 and frozen the features for common voice 1-hour ASR task, I got avg per of 45.2% on 5 languages (es, fr, it, ru, tt), which is reported as 43.9% in your paper (Table 3), I think my results is close (-1.3%) to what you reported, which seemed reasonable.

But when I test the pre-trained features on 5-hour common voice ASR tasks (es, fr, it, ru, tt), I just got a avg per (frozen features) of 42.5%, which had a big gap (-3.7%) with the reported per (38.8%, Table 5 in paper); when finetuning features, the gap was even bigger, the avg per was 37.2% (in the paper it is reported as 31.0%).
Unfortunately, the 5-hour common voice ASR experiments also perform badly when training from scratch, a avg per of 43.2%, far behind 38.3% reported in your paper.

I will be very thankful if you kindly provide more detailed hyper-parameters to help me reproduce your results.
Especially, I noticed you have set a optional argument --LSTM in ./eval/common_voices_eval.py to add a LSTM layer before the linear softmax layer. I think it would significantly increase the model capacity and may lead to better performance, did you use it?
Thnk you very much!

For now I used the default hyper-parameters on common voice ASR transfer experiments:
--batchSize 8
--lr 2e-4
--nEpoch 30
--kernelSize 8
......

@sameerkhurana10
Copy link

Hi @zz12375,

I am also not able to reproduce the results.

Could you please let me know what is the PER you are getting on fr alone?

@zz12375
Copy link
Author

zz12375 commented Jan 4, 2021

Hi~@sameerkhurana10

Evaluated on 300 epoch pretraining model, the PER tuned with 5-hour fr is 46.49% (frozen features) and 40.05% (finetuning features).

@sameerkhurana10
Copy link

sameerkhurana10 commented Jan 4, 2021

Thanks.

Here are some PERs on the french test set:

This model:

Wav2Vec [Paper]:

ConvDMM [Paper]

  • Classifier Training Data: 1 hour French
  • Feature Extractor Frozen: Yes
  • Model Training Data: Librispeech 960 hours
  • PER: 46. %

Note: I did not use this toolkit to get PER, but I wanted to be consistent with the CTC phone classifier that this repo is using so I copied the following class in my codebase:

class CTCphone_criterion(torch.nn.Module):

In your case, maybe increase the number of epochs

@zz12375
Copy link
Author

zz12375 commented Jan 4, 2021

Thank you very much for sharing your results.

You mean increasing the number of epochs during the down-stream finetuning stage (CTC loss), or pretraining stage?

@sameerkhurana10
Copy link

I meant during downstream CTC phone classifier training.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants