Skip to content

DavidMouse1118/Img2Text

Repository files navigation

Img2Text

Image Caption Generator

Using CNN-RNN Merge Architecture with Glove Embedding

Let's take a look:

man in black jacket is standing in front of large building little girl in pink shirt is sitting in front of rainbow painting group of people are standing in front of fence

Model and Hyperparameter:

CNN InceptionV3
RNN CuDNNLSTM
Word Embedding Glove
Loss sparse_categorical_crossentropy
Optimizer Adam
Embedding Dimension 300
Embedding Trainable True
Layer Size 256
Dropout Rate 0.5
Max Epochs 20
Early stopping monitor='val_loss', min_delta=0.01, patience=10
Model Checkpoint monitor='val_loss', save_best_only=True
Batch Size 2048

Performance:

1. Crossentropy Loss (Lower the better)

With Glove Without Glove
Train 2.6006 2.6338
Dev 3.0556 3.1157

2. CIDEr Score (Test Set)

CIDEr
With Glove Greedy Search 0.44643053
Beam Search (B = 3) 0.48076379
Without Glove Greedy Search 0.46058652
Beam Search (B = 3) 0.49261228

3. BLEU Score (Test Set)

BLEU-1 BLEU-2 BLEU-3 BLEU-4
With Glove Greedy Search 0.594145 0.373877 0.242624 0.152859
Beam Search (B = 3) 0.604485 0.393628 0.267615 0.177017
Without Glove Greedy Search 0.612248 0.391558 0.254003 0.16078
Beam Search (B = 3) 0.619149 0.405042 0.27374 0.179306

Conclusion:

  1. Use InceptionV3 to encode image.
  2. Beam Search is better then Greedy Search by 10% on both CIDEr and BLEU Score.
  3. Use large Batch with early stopping and checkpoints can achieve lower crossentropy loss
  4. Fine-tune a pre-trained embedding maynot always gives better result.

About

Image Caption Generator

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published