Using CNN-RNN Merge Architecture with Glove Embedding
man in black jacket is standing in front of large building | little girl in pink shirt is sitting in front of rainbow painting | group of people are standing in front of fence |
CNN | InceptionV3 |
RNN | CuDNNLSTM |
Word Embedding | Glove |
Loss | sparse_categorical_crossentropy |
Optimizer | Adam |
Embedding Dimension | 300 |
Embedding Trainable | True |
Layer Size | 256 |
Dropout Rate | 0.5 |
Max Epochs | 20 |
Early stopping | monitor='val_loss', min_delta=0.01, patience=10 |
Model Checkpoint | monitor='val_loss', save_best_only=True |
Batch Size | 2048 |
With Glove | Without Glove | |
---|---|---|
Train | 2.6006 | 2.6338 |
Dev | 3.0556 | 3.1157 |
CIDEr | ||
With Glove | Greedy Search | 0.44643053 |
Beam Search (B = 3) | 0.48076379 | |
Without Glove | Greedy Search | 0.46058652 |
Beam Search (B = 3) | 0.49261228 |
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ||
With Glove | Greedy Search | 0.594145 | 0.373877 | 0.242624 | 0.152859 |
Beam Search (B = 3) | 0.604485 | 0.393628 | 0.267615 | 0.177017 | |
Without Glove | Greedy Search | 0.612248 | 0.391558 | 0.254003 | 0.16078 |
Beam Search (B = 3) | 0.619149 | 0.405042 | 0.27374 | 0.179306 |
- Use InceptionV3 to encode image.
- Beam Search is better then Greedy Search by 10% on both CIDEr and BLEU Score.
- Use large Batch with early stopping and checkpoints can achieve lower crossentropy loss
- Fine-tune a pre-trained embedding maynot always gives better result.