update documentation and example scripts

mlfoundations · Sep 16, 2023 · 92bc4b7 · 92bc4b7
1 parent b0ff9a4
commit 92bc4b7
Show file tree

Hide file tree

Showing 15 changed files with 261 additions and 25 deletions.
diff --git a/docs/inputs.png b/docs/inputs.png
diff --git a/docs/signature.png b/docs/signature.png
diff --git a/docs/xattn_langstream.png b/docs/xattn_langstream.png
diff --git a/open_flamingo/eval/README.md b/open_flamingo/eval/README.md
@@ -1,5 +1,4 @@
 # OpenFlamingo Evaluation Suite
-
 This is the evaluation module of OpenFlamingo. It contains a set of utilities for evaluating multimodal models on various benchmarking datasets.
 
 *This module is a work in progress! We will be updating this README as it develops. In the meantime, if you notice an issue, please file a Bug Report or Feature Request [here](https://github.com/mlfoundations/open_flamingo/issues/new/choose).*
@@ -19,8 +18,19 @@ This is the evaluation module of OpenFlamingo. It contains a set of utilities fo
 
 When evaluating a model using `num_shots` shots, we sample the exemplars from the training split. Performance is evaluated on a disjoint test split, subsampled to `--num_samples` examples (or using the full test split if `--num_samples=-1`).
 
+## Supported models
+This evaluation module interfaces with models using the `EvalModel` class defined in `eval/eval_models/eval_model.py`. The `EvalModel` wrapper standardizes the generation and rank classification interfaces.
+
+To help standardize VLM evaluations, we have implemented EvalModel wrappers for models from three code repositories:
+
+* This open_flamingo repository, i.e. all models created using this repository's `src` code
+* The pretrained [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) models. Note that this model can only take in one image per input sequence; this is not to be confused with the BLIP-like implementation in the open_flamingo repository, which can take in arbitrarily interleaved image/text sequences
+* Huggingface's [IDEFICS](https://huggingface.co/blog/idefics) models
+
 ## Sample scripts
-Our codebase uses DistributedDataParallel to parallelize evaluation by default, so please make sure to set the `MASTER_ADDR` and `MASTER_PORT` environment variables or use `torchrun`. We provide a sample Slurm evaluation script in `open_flamingo/open_flamingo/scripts/run_eval.sh`. 
+Our codebase uses DistributedDataParallel to parallelize evaluation by default, so please make sure to set the `MASTER_ADDR` and `MASTER_PORT` environment variables or use `torchrun`. We provide a sample Slurm evaluation script in `open_flamingo/open_flamingo/scripts/run_eval.sh`.
+
+We have also implemented distributed evaluation using Deepspeed, which additionally shards model parameters across GPUs for memory savings. To use Deepspeed instead of DDP, use the `--deepspeed` flag.
 
 We also support evaluating at a lower precision using the `--precision` flag. We find minimal difference between evaluating at full precision vs. amp_bf16.
 

diff --git a/open_flamingo/scripts/fill_vqa_testdev_results.py b/open_flamingo/scripts/fill_vqa_testdev_results.py
@@ -1,5 +1,5 @@
 """
-Helper scripts to prepare a vqa test-dev evaluation for EvalAI submission.
+Helper scripts to prepare a Vizwiz or VQAv2 test-dev evaluation for EvalAI submission.
 Note: EvalAI requires VQAv2 submissions to have predictions for all the questions in the test2015 set, not just the test-dev set.
 Given a json with a subset of the vqa questions, fill in the rest of the questions with an empty string as the model prediction.
 """

diff --git a/open_flamingo/scripts/run_eval.sh → open_flamingo/scripts/run_eval_ddp.sh b/open_flamingo/scripts/run_eval.sh → open_flamingo/scripts/run_eval_ddp.sh
@@ -9,6 +9,7 @@ Notes:
 - VQAv2 test-dev and test-std annotations are not publicly available. 
   To evaluate on these splits, please follow the VQAv2 instructions and submit to EvalAI.
   This script will evaluate on the val split.
+- Vizwiz test-dev annotations are also not publicly available; please go through EvalAI.
 com
 
 export PYTHONFAULTHANDLER=1

diff --git a/open_flamingo/scripts/run_eval_deepspeed.sh b/open_flamingo/scripts/run_eval_deepspeed.sh
@@ -0,0 +1,77 @@
+#!/bin/bash
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=2
+#SBATCH --gpus-per-task=1
+
+<<com
+Example Slurm evaluation script. 
+Notes:
+- VQAv2 test-dev and test-std annotations are not publicly available. 
+  To evaluate on these splits, please follow the VQAv2 instructions and submit to EvalAI.
+  This script will evaluate on the val split.
+- Vizwiz test-dev annotations are also not publicly available; please go through EvalAI.
+com
+
+export PYTHONFAULTHANDLER=1
+export CUDA_LAUNCH_BLOCKING=0
+export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
+export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
+export MASTER_PORT=$(shuf -i 0-65535 -n 1)
+export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
+
+echo go $COUNT_NODE
+echo $HOSTNAMES
+
+export PYTHONPATH="$PYTHONPATH:open_flamingo"
+srun --cpu_bind=v --accel-bind=gn python
+deepspeed open_flamingo/open_flamingo/eval/evaluate.py \
+    --vision_encoder_path ViT-L-14 \
+    --vision_encoder_pretrained openai\
+    --lm_path anas-awadalla/mpt-1b-redpajama-200b \
+    --tokenizer_path anas-awadalla/mpt-1b-redpajama-200b \
+    --cross_attn_every_n_layers 1 \
+    --checkpoint_path "openflamingo/OpenFlamingo-3B-vitl-mpt1b/checkpoint.pt" \
+    --results_file "results.json" \
+    --precision fp32 \
+    --batch_size 8 \
+    --deepspeed \
+    --eval_coco \
+    --eval_vqav2 \
+    --eval_flickr30 \
+    --eval_ok_vqa \
+    --eval_textvqa \
+    --eval_vizwiz \
+    --eval_hateful_memes \
+    --coco_train_image_dir_path "/path/to/mscoco_karpathy/train2014" \
+    --coco_val_image_dir_path "/path/to/mscoco_karpathy/val2014" \
+    --coco_karpathy_json_path "/path/to/mscoco_karpathy/dataset_coco.json" \
+    --coco_annotations_json_path "/path/to/mscoco_karpathy/annotations/captions_val2014.json" \
+    --vqav2_train_image_dir_path "/path/to/vqav2/train2014" \
+    --vqav2_train_annotations_json_path "/path/to/vqav2/v2_mscoco_train2014_annotations.json" \
+    --vqav2_train_questions_json_path "/path/to/vqav2/v2_OpenEnded_mscoco_train2014_questions.json" \
+    --vqav2_test_image_dir_path "/path/to/vqav2/val2014" \
+    --vqav2_test_annotations_json_path "/path/to/vqav2/v2_mscoco_val2014_annotations.json" \
+    --vqav2_test_questions_json_path "/path/to/vqav2/v2_OpenEnded_mscoco_val2014_questions.json" \
+    --flickr_image_dir_path "/path/to/flickr30k/flickr30k-images" \
+    --flickr_karpathy_json_path "/path/to/flickr30k/dataset_flickr30k.json" \
+    --flickr_annotations_json_path "/path/to/flickr30k/dataset_flickr30k_coco_style.json" \
+    --ok_vqa_train_image_dir_path "/path/to/okvqa/train2014" \
+    --ok_vqa_train_annotations_json_path "/path/to/okvqa/mscoco_train2014_annotations.json" \
+    --ok_vqa_train_questions_json_path "/path/to/okvqa/OpenEnded_mscoco_train2014_questions.json" \
+    --ok_vqa_test_image_dir_path "/path/to/okvqa/val2014" \
+    --ok_vqa_test_annotations_json_path "/path/to/okvqa/mscoco_val2014_annotations.json" \
+    --ok_vqa_test_questions_json_path "/path/to/okvqa/OpenEnded_mscoco_val2014_questions.json" \
+    --textvqa_image_dir_path "/path/to/textvqa/train_images/" \
+    --textvqa_train_questions_json_path "/path/to/textvqa/train_questions_vqa_format.json" \
+    --textvqa_train_annotations_json_path "/path/to/textvqa/train_annotations_vqa_format.json" \
+    --textvqa_test_questions_json_path "/path/to/textvqa/val_questions_vqa_format.json" \
+    --textvqa_test_annotations_json_path "/path/to/textvqa/val_annotations_vqa_format.json" \
+    --vizwiz_train_image_dir_path "/path/to/v7w/train" \
+    --vizwiz_test_image_dir_path "/path/to/v7w/val" \
+    --vizwiz_train_questions_json_path "/path/to/v7w/train_questions_vqa_format.json" \
+    --vizwiz_train_annotations_json_path "/path/to/v7w/train_annotations_vqa_format.json" \
+    --vizwiz_test_questions_json_path "/path/to/v7w/val_questions_vqa_format.json" \
+    --vizwiz_test_annotations_json_path "/path/to/v7w/val_annotations_vqa_format.json" \
+    --hateful_memes_image_dir_path "/path/to/hateful_memes/img" \
+    --hateful_memes_train_annotations_json_path "/path/to/hateful_memes/train.json" \
+    --hateful_memes_test_annotations_json_path "/path/to/hateful_memes/dev.json" \
diff --git a/open_flamingo/scripts/run_train_ddp.sh b/open_flamingo/scripts/run_train_ddp.sh
@@ -0,0 +1,34 @@
+#!/bin/bash
+#SBATCH --nodes 1
+#SBATCH --ntasks-per-node=8
+#SBATCH --gpus-per-task=1
+#SBATCH --time=5-00:00:00
+#SBATCH --job-name=openflamingo
+
+export PYTHONFAULTHANDLER=1
+export CUDA_LAUNCH_BLOCKING=0
+export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
+export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
+export MASTER_PORT=$(shuf -i 0-65535 -n 1)
+export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
+
+export PYTHONPATH="$PYTHONPATH:open_flamingo"
+srun --cpu_bind=v --accel-bind=gn python open_flamingo/open_flamingo/train/train.py \
+    --lm_path meta-llama/Llama-2-13b \
+    --tokenizer_path meta-llama/Llama-2-13b \
+    --model_family flamingo \
+    --cross_attn_every_n_layers 4 \
+    --dataset_resampled \
+    --batch_size_mmc4 16 \
+    --batch_size_laion 32 \
+    --train_num_samples_mmc4 125000\
+    --train_num_samples_laion 250000 \
+    --loss_multiplier_laion 0.2 \
+    --workers=4 \
+    --run_name "fsdp" \
+    --num_epochs 480 \
+    --warmup_steps  0 \
+    --mmc4_textsim_threshold 0.0 \
+    --laion_shards "/path/to/laion-samples/{000000..000001}.tar" \
+    --mmc4_shards "/path/to/mmc4-samples/{000000..000001}.tar" \
+    --report_to_wandb
diff --git a/open_flamingo/scripts/run_train.sh → open_flamingo/scripts/run_train_deepspeed.sh b/open_flamingo/scripts/run_train.sh → open_flamingo/scripts/run_train_deepspeed.sh
@@ -1,26 +1,24 @@
 #!/bin/bash
 #SBATCH --nodes 1
-#SBATCH --ntasks-per-node=6
+#SBATCH --ntasks-per-node=8
 #SBATCH --gpus-per-task=1
-#SBATCH --account=efml
-#SBATCH --partition=gpu
-#SBATCH --time=48:00:00
-#SBATCH --job-name=flamingo
+#SBATCH --time=5-00:00:00
+#SBATCH --job-name=openflamingo
 
 export PYTHONFAULTHANDLER=1
 export CUDA_LAUNCH_BLOCKING=0
 export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
 export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
-export MASTER_PORT=15000
+export MASTER_PORT=$(shuf -i 0-65535 -n 1)
 export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
-export HF_DATASETS_CACHE="/gscratch/efml/anasa2/.huggingface" TRANSFORMERS_CACHE="/gscratch/efml/anasa2/.huggingface"
 
 export PYTHONPATH="$PYTHONPATH:open_flamingo"
 srun --cpu_bind=v --accel-bind=gn python 
 
 deepspeed open_flamingo/open_flamingo/train/train.py \
     --lm_path meta-llama/Llama-2-13b \
     --tokenizer_path meta-llama/Llama-2-13b \
+    --model_family flamingo \
     --cross_attn_every_n_layers 4 \
     --dataset_resampled \
     --batch_size_mmc4 16 \
@@ -34,7 +32,6 @@ deepspeed open_flamingo/open_flamingo/train/train.py \
     --num_epochs 480 \
     --warmup_steps  0 \
     --mmc4_textsim_threshold 0.0 \
-    --laion_shards "/mmfs1/gscratch/efml/anasa2/laion-samples/{000000..000001}.tar" \
-    --mmc4_shards "/mmfs1/gscratch/efml/anasa2/mmc4-samples/shard_{0..1}-000000000.tar" \
-    --gradient_checkpointing \
-    --report_to_wandb \
+    --laion_shards "/path/to/laion-samples/{000000..000001}.tar" \
+    --mmc4_shards "/path/to/mmc4-samples/{000000..000001}.tar" \
+    --report_to_wandb
diff --git a/open_flamingo/scripts/run_train_fsdp.sh b/open_flamingo/scripts/run_train_fsdp.sh
@@ -0,0 +1,40 @@
+#!/bin/bash
+#SBATCH --nodes 1
+#SBATCH --ntasks-per-node=8
+#SBATCH --gpus-per-task=1
+#SBATCH --time=5-00:00:00
+#SBATCH --job-name=openflamingo
+
+<<com
+To use FSDP, please make sure to use Pytorch Nightly > 2.0.1!
+com
+
+export PYTHONFAULTHANDLER=1
+export CUDA_LAUNCH_BLOCKING=0
+export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
+export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
+export MASTER_PORT=$(shuf -i 0-65535 -n 1)
+export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
+
+export PYTHONPATH="$PYTHONPATH:open_flamingo"
+srun --cpu_bind=v --accel-bind=gn python open_flamingo/open_flamingo/train/train.py \
+    --lm_path meta-llama/Llama-2-13b \
+    --tokenizer_path meta-llama/Llama-2-13b \
+    --model_family flamingo \
+    --cross_attn_every_n_layers 4 \
+    --dataset_resampled \
+    --batch_size_mmc4 16 \
+    --batch_size_laion 32 \
+    --fsdp \
+    --fsdp_sharding_strategy hybrid \
+    --train_num_samples_mmc4 125000\
+    --train_num_samples_laion 250000 \
+    --loss_multiplier_laion 0.2 \
+    --workers=4 \
+    --run_name "fsdp" \
+    --num_epochs 480 \
+    --warmup_steps  0 \
+    --mmc4_textsim_threshold 0.0 \
+    --laion_shards "/path/to/laion-samples/{000000..000001}.tar" \
+    --mmc4_shards "/path/to/mmc4-samples/{000000..000001}.tar" \
+    --report_to_wandb
diff --git a/open_flamingo/src/README.md b/open_flamingo/src/README.md
@@ -0,0 +1,56 @@
+# OpenFlamingo: Modeling
+We provide modules to mix-and-match into several vision-language model architectures.
+
+## What is a VLM?
+A **vision-language model (VLM)** is a language model capable of processing a sequence of arbitraily interleaved images/videos with text to output text. 
+
+![A VLM takes in a sequence of interleaved images/videos with text and outputs text.](../../docs/signature.png)
+
+The forward signature of a VLM is as follows:
+
+* `vision_x`: The batch of images / videos to process. This is a tensor of the shape `(B, T_img, F, C, H, W)`, where `B` is the batch dimension, `T_img` collates the images/videos within one input sequence, `F` collates frames within a video, and `(C, H, W)` are the channel, height, and width dimensions respectively.
+* `lang_x`: The batch of input_ids (text) to process. This is a tensor of the shape `(B, T_txt)`, where `T_txt` is the number of text tokens within one input sequence. 
+
+To explain to the model how to interleave the image/text elements within a sequence, `lang_x` should include `<image>` tokens ("media tokens") that specify where the images/videos are placed. (See figure below)
+
+![Illustration of what the inputs to a VLM look like.](../../docs/inputs.png)
+
+
+## VLM modeling with the open_flamingo repository
+This repository provides modules for constructing various VLM architectures.
+
+All models inherit from the `VLM` (vision-language model) class defined in `src/vlm.py`. As documented there, a VLM is defined by four component modules:
+1. A **vision encoder** that extracts features from pixels (e.g. CLIP). This module should take in vision inputs of the shape `(B, T_img, F, C, H, W)` and output features of the shape `(B, T_img, F, v, d)`.
+2. A **vision tokenizer** that converts features from the vision encoder into token-like embeddings (e.g. PerceiverResampler). This module should take in vision features of the shape `(B, T_img, F, v, d)` and output tokens of the shape `(B, T_img, n, d)`.
+3. A fusion method that allows the language model to attend to these tokens, e.g. cross-attention (as done in [Flamingo](https://arxiv.org/abs/2204.14198)), or placing the tokens directly in the language model's input sequence (as done in [Kosmos](https://arxiv.org/abs/2306.14824)).
+4. A language model.
+
+This repository allows us to construct architectures by mixing-and-matching options for all four kinds of modules. 
+
+### Supported vision encoders
+All CLIP-style encoders from the [OpenCLIP](https://github.com/mlfoundations/open_clip) library are supported. This includes OpenAI's models.
+
+### Supported vision tokenizers
+* [Perceiver Resampler](https://arxiv.org/abs/2103.03206)
+* [Q-former](https://arxiv.org/abs/2301.12597)
+* Linear projection
+
+### Supported fusion methods
+Models are further split into those that inherit from `VLMWithCrossAttention` (dense cross attention to fuse vision + language, Flamingo-style) vs. `VLMWithLanguageStream` (insert vision tokens into the language stream, Kosmos-style).
+
+![A VLM with cross attention and a VLM with language stream represent two methods for fusing the vision and language inputs.](../../docs/xattn_langstream.png)
+
+### Supported language models
+All autoregressive language models from [Huggingface Transformers](https://huggingface.co/models) are supported.
+
+## Example architectures
+Using these modules, the following architectures are implemented as examples.
+
+|Model|Vision tokenizer|Fusion method|Trainable parameters|
+|----|------------|------------|------------|
+|[Flamingo](https://arxiv.org/abs/2204.14198)|Perceiver|Cross attention|Added language model embeddings, vision tokenizer|
+|[Kosmos](https://arxiv.org/abs/2306.14824)|Perceiver|Language stream|Everything except the vision encoder|
+|[BLIP](https://arxiv.org/abs/2301.12597)|Q-former|Language stream|Added language model embeddings, vision tokenizer|
+
+We welcome contributions! If you'd like to add additional vision tokenizers, fusion methods, or model types, please open a PR.
+
diff --git a/open_flamingo/train/README.md b/open_flamingo/train/README.md
@@ -1,9 +1,23 @@
 # OpenFlamingo Training
-To train OpenFlamingo, please ensure your environment matches that of `environment.yml`.
+We provide efficient data loading and distributed training code.
+To train with OpenFlamingo, please ensure your environment matches that of `environment.yml`.
+
+Table of contents:
+
+* [Data](#data)
+* [Example commands](#example-training-command)
+* [Distributed training](#distributed-training)
 
 ## Data
 Our codebase uses [WebDataset](https://github.com/webdataset/webdataset) to efficiently load `.tar` files containing image and text sequences. We recommend resampling shards with replacement during training using the `--dataset_resampled` flag. 
 
+Supported pretraining datasets
+* LAION-2B
+* Multimodal C4 (MMC4)
+* ChatGPT-generated sequences from OpenFlamingo [technical report](https://arxiv.org/abs/2308.01390)
+
+We plan to add additional datasets in the future, and we welcome contributions! If you'd like to add support for a pretraining dataset, please open a PR.
+
 ### LAION-2B Dataset
 [LAION-2B](https://arxiv.org/abs/2210.08402) contains 2B web-scraped (image, text) pairs. 
 We use [img2dataset](https://github.com/rom1504/img2dataset) to download this dataset into tar files.
@@ -27,7 +41,7 @@ Models trained with ChatGPT-generated sequences:
 * OpenFlamingo-4B-vitl-rpj3b-langinstruct
 
 ## Example training command
-We provide a sample Slurm training script in `scripts/`. You can also modify the following command:
+We provide sample Slurm training scripts in `scripts/`. You can also modify the following command:
 
 ```
 torchrun --nnodes=1 --nproc_per_node=4 train.py \
@@ -52,9 +66,17 @@ torchrun --nnodes=1 --nproc_per_node=4 train.py \
 *Note: The MPT-1B [base](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b)  and [instruct](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b-dolly) modeling code does not accept the `labels` kwarg or compute cross-entropy loss directly within `forward()`, as expected by our codebase. We suggest using a modified version of the MPT-1B models found [here](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b) and [here](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b-dolly).*
 
 ## Distributed training
+Our codebase supports distributed training using three frameworks:
+
+* Pytorch's [DistributedDataParallel](https://pytorch.org/docs/stable/torch.nn.parallel.DistributedDataParallel.html). This is the default method used by `train.py`.
+* Pytorch's [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html) (FSDP). Use the `--fsdp` flag.
+* [DeepSpeed](https://github.com/microsoft/DeepSpeed) stages 1-3. Use the `--deepspeed` flag.
+
+Note that you should use exactly one of these training methods.
+
+`train/distributed.py` contains utilities to help with setting up distributed training using Slurm / `torchrun`. See example scripts in the `scripts` directory.
+
+### FSDP notes
+To use FSDP, make sure to use Pytorch Nightly (> 2.0.1). 
 
-By default, `train.py` uses Pytorch's [DistributedDataParallel](https://pytorch.org/docs/stable/torch.nn.parallel.DistributedDataParallel.html) for training. 
-To use [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html), make sure to use Pytorch Nightly (> 2.0.1), and use the `--fsdp` flag. 
-To use [DeepSpeed](https://github.com/microsoft/DeepSpeed), use the `--deepspeed` flag. 
-(Note that you should use *either* FSDP or Deepspeed, not both.)
-We also implement gradient checkpointing and mixed precision training. Use the `--gradient_checkpointing` and `--precision` arguments respectively.
+We support two sharding strategies for FSDP: full sharding (model sharing across all nodes and GPUs) or hybrid sharding (model sharding across GPUs within nodes, data parallel between nodes). The former saves GPU memory; the latter saves on communication costs.
diff --git a/requirements-eval.txt b/requirements-eval.txt
@@ -5,9 +5,7 @@ inflection
 pycocoevalcap
 pycocotools
 tqdm
-
-black
 mypy
 pylint
 pytest
-requests
+requests
diff --git a/requirements-training.txt b/requirements-training.txt
@@ -3,3 +3,4 @@ braceexpand
 webdataset
 tqdm
 wandb
+deepspeed
diff --git a/requirements.txt b/requirements.txt
@@ -1,7 +1,7 @@
 einops
 einops-exts
 transformers>=4.28.1
-torch==2.0.1
+torch>=2.0.1
 pillow
 open_clip_torch>=2.16.0 
 sentencepiece==0.1.98