a pytorch version lstm language model, support class-based softmax (Following the paper) and NCE (noise contrasitve estimation, following the paper], and thanks Stonesjtu's amazing project) for speeding up .
In class-based softmax, each word is assigned to one class, hence the probability of a word become:
Theoretically, the computational cost can be reduced from O(dk) to O(d\sqrt{k}), where d is the size of last hidden layer and k is the size of vocabulary.
But in pratice, there are too many overhead (especially in GPU).
NCE transfers the probability estimation problem into a binary classification problem. In NCE, we have a noise distributiona and our goal is to train a model to differentiate the target word from noise. The biggest trick in NCE is that, we treat the probability normalization term as a constant, which saves a lots of time for both training and testing.
Before training the model, please run the following script to build a vocab with class:
python build_vocab_with_class.py --ncls 30 --min_count 0
The vocab built above is based on the frequence, you can also build your own vocab using other methods. (see example in ./data/penn/vocab.c.txt, Notice that the class should be a integer.)
Run training script:
python train.py --cuda --data [data_path] --decoder [sm|nce|cls]
data/
: corpus dictionaryparams/
: save the parametersdata.py
: custom data iter and dictionarymodel.py
: the basic rnn modeldecoder.py
: the decoder layers (softmax, class-based softmax and NCE)train.py
: the training processutils.py
: utilize functions
Experiments on swb corpus (6W vocab):
epoch average training time:
softmax: 1061s
nce: 471s
class-based softmax: 465s