Merge branch 'main' of https://github.com/rhymes-ai/Aria

rhymes-ai · Oct 3, 2024 · 53c38aa · 53c38aa
2 parents 507ed2c + b531137
commit 53c38aa
Show file tree

Hide file tree

Showing 22 changed files with 3,292 additions and 29 deletions.
diff --git a/.isort.cfg b/.isort.cfg
@@ -1,2 +1,3 @@
 [settings]
-profile=black
+profile=black
+skip=datasets
diff --git a/README.md b/README.md
@@ -2,31 +2,34 @@
 
 [😊 Hugging Face](#) | 
 [📄 Paper](#) | 
-[📰 Blog](#) | 
-[📚 Tutorial](#) | 
-[💻 Demo](#) | 
-[🌐 Website](#) | 
+[📚 Blog](#) | 
+[🌐 WebDemo](#) 
 
 
 ## Introduction
-Aria is the first open MoE model that is natively multimodal. It features SoTA performance on OCR and video understanding tasks, competitve performance on language and coding tasks, and fast inference speed with merely 3.9B activated parameters per token. 
-
-| Category                            | Benchmark               | Aria  | Pixtral 12B | Llama3 8B | Llama3-V 8B | GPT-4V | GPT-4o mini | GPT-4o | Gemini-1.5 Flash | Gemini-1.5 Pro |
+Aria is a multimodal native MoE model. It features:
+- State-of-the-art performance on various multimodal and language tasks, superior in video and document understanding;
+- Long multimodal context window of 64K tokens;
+- 3.9B activated parameters per token, enabling fast inference speed and low fine-tuning cost.
+
+<!-- 
+| Category                            | Benchmark               | Aria  | Pixtral 12B | Llama3.2 11B | Llama3-V 8B | GPT-4V | GPT-4o mini | GPT-4o | Gemini-1.5 Flash | Gemini-1.5 Pro |
 |-------------------------------------|-------------------------|-------|-------------|-----------|-------------|--------|-------------|--------|------------------|----------------|
-| **Knowledge(Multimodal)**                  | MMMU              | 54.2  | 52.5        | -         | 49.6        | 56.4   | 59.4        | 69.1   | 56.1             | 62.2           |
-| **Math(Multimodal)**                    | MathVista   | 64.1  | 58.0        | -         | -           | -      | 54.7        | 63.8   | 58.4             | 63.9           |
-| **Document**       | DocQA            | 92.9  | 90.7           | -         | 84.4        | 88.4   | -           | 92.8  | 89.9             | 93.1           |
-| **Chart**               | ChartQA           | 86.1  | 81.8        | -         | 78.7        | 78.4   | -           | 85.7   | 85.4             | 87.2           |
+| **Knowledge(Multimodal)**                  | MMMU              | 54.9  | 52.5        | -         | 49.6        | 56.4   | 59.4        | 69.1   | 56.1             | 62.2           |
+| **Math(Multimodal)**                    | MathVista   | 66.1  | 58.0        | -         | -           | -      | 54.7        | 63.8   | 58.4             | 63.9           |
+| **Document**       | DocQA            | 92.6  | 90.7           | -         | 84.4        | 88.4   | -           | 92.8  | 89.9             | 93.1           |
+| **Chart**               | ChartQA           | 86.4  | 81.8        | -         | 78.7        | 78.4   | -           | 85.7   | 85.4             | 87.2           |
 | **Scene Text**                                      | TextVQA         | 81.1  | -           | -         | 78.2        | 78.0      | -           | -      | 78.7                | 78.7              |
-| **General Visual QA**               | MMBench-1.1             | 81.1  | -           | -         | -           | 79.8   | 76.0        | 82.2   | -                | 73.9           |
-| **Video Understanding**        | LongVideoBench  | 64.0  | 47.4           | -      | -           | 60.7   | 58.8        | 66.7      | 62.4                | 64.4              |
-| **Knowledge(Language)**        | MMLU (5-shot)           | 73.6  | 69.2        | 69.4      | -        | 86.4   | -           | 89.1   | 78.9             | 85.9           |
-| **Math(Language)**                      | MATH              | 50.0  | 48.1        | 51.9         | -        | -      | 70.2           | 76.6   | -            | -           |
+| **General Visual QA**               | MMBench-1.1             | 80.3  | -           | -         | -           | 79.8   | 76.0        | 82.2   | -                | 73.9           |
+| **Video Understanding**        | LongVideoBench  | 65.3  | 47.4           | -      | -           | 60.7   | 58.8        | 66.7      | 62.4                | 64.4              |
+| **Knowledge(Language)**        | MMLU (5-shot)           | 73.3  | 69.2        | 69.4      | -        | 86.4   | -           | 89.1   | 78.9             | 85.9           |
+| **Math(Language)**                      | MATH              | 50.8  | 48.1        | 51.9         | -        | -      | 70.2           | 76.6   | -            | -           |
 | **Reasoning(Language)**                                    | ARC Challenge           | 91.0  | -           | 83.4         | -        | -      | 96.4           | 96.7      | -                | -              |
-| **Coding**                          | HumanEval               | 75.6  | 72.0        | 72.6      | -        | 67.0   | 87.2        | 90.2   | 74.3             | 84.1           |
-
+| **Coding**                          | HumanEval               | 73.2  | 72.0        | 72.6      | -        | 67.0   | 87.2        | 90.2   | 74.3             | 84.1           |
+-->
 
 ## News
+- 2024.10.10: We release Aria!
 
 ## Quick Start
 
@@ -42,9 +45,9 @@ pip install flash-attn --no-build-isolation
 
 ### Inference
 
-The total number of parameters in Aria is about 25B, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.
+Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.
 
-Performing inference is simple with the Hugging Face ecosystem:
+Here is a code snippet to show you how to use Aria with Hugging Face Transformers.
 
 ```python
 import requests
@@ -94,6 +97,97 @@ print(result)
 
 We offer additional inference methods, such as utilizing [VLLM](https://github.com/vllm-project/vllm) for enhanced performance. For comprehensive details, please refer to [docs/inference.md](docs/inference.md).
 
-### Fine-tuning
+### Cookbook
+Checkout these [inference examples](https://github.com/rhymes-ai/Aria/tree/main/inference/notebooks) that demonstrate how to use Aria on various applications such as chart understanding, PDF reading, video understanding, etc.
+
+## Fine-tuning
+
+We offer both LoRA fine-tuning and full parameter tuning, using various dataset types:
+- Single-image datasets
+- Multi-image datasets
+- Video datasets
+
+For a quick try, visit the [examples](./examples) folder and choose one of the fine-tuning examples.
+
+### Prepare dataset
+Please refer to [custom_dataset.md](custom_dataset.md) for how to prepare your dataset.
+
+### Fine-tune with LoRA
+
+After preparing your dataset, follow these steps to fine-tune Aria using LoRA:
+
+1. Open the configuration file `recipes/config_lora.yaml`. Locate the `dataset_mixer` section and update it with your dataset paths:
+
+```yaml
+dataset_mixer:
+  "path/to/dataset1": 1
+  "path/to/dataset2": 0.5
+  "path/to/dataset3": 2
+```
+
+> **Note on dataset mixing:** Aria supports combining multiple datasets with different sampling rates. In the example above:
+> - `dataset1` will be used entirely (weight 1)
+> - `dataset2` will use 50% of its data (weight 0.5)
+> - `dataset3` will be used twice (weight 2)
+
+2. Start the fine-tuning process by running the following command on one A100 (80GB) or H100 (80GB) GPU:
+
+```bash
+python aria/train.py --config recipes/config_lora.yaml
+```
+
+3. For multi-GPU training, use the [`accelerate`](https://huggingface.co/docs/accelerate/index) library:
+
+```bash
+accelerate launch --config_file recipes/accelerate_configs/zero2.yaml aria/train.py --config recipes/config_lora.yaml --num_processes [number_of_gpus]
+```
+
+   - Choose from pre-configured accelerate settings in `recipes/accelerate_configs/`
+   - Adjust the `--num_processes` argument to match your available GPUs
+   - For custom configurations, refer to the [accelerate documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)
+
+4. Inference with the fine-tuned model:
+
+See [inference with LoRA support](inference.md#2-inference-with-lora-support) for how to inference with the fine-tuned model.
+
+### Full parameter fine-tuning
+
+Everything is the same as the LoRA fine-tuning process, except for the configuration file `recipes/config_full.yaml`.
+
+Full parameter tuning consumes more GPU memory, thus multiple GPUs are required. The following command has been tested on 8 A100 (80GB) GPUs.
+
+```bash
+accelerate launch --config_file recipes/accelerate_configs/zero2.yaml aria/train.py --config recipes/config_full.yaml
+```
+
+If you encounter out-of-memory errors, try reducing the `per_device_train_batch_size` in the config file. Adjust the `gradient_accumulation_steps` accordingly to maintain the effective training batch size.
+
+```yaml
+per_device_train_batch_size: 8
+gradient_accumulation_steps: 2
+```
+
+Memory consumption varies across datasets. Generally, more memory is required for multi-image and video datasets. Adjust the `deepspeed_config` parameters to optimize memory consumption, such as using `zero_stage` 3 and offloading parameters and optimizer to the CPU.
+
+```yaml
+deepspeed_config:
+  gradient_accumulation_steps: auto
+  gradient_clipping: auto
+  offload_optimizer_device: cpu
+  offload_param_device: cpu
+  zero3_init_flag: true
+  zero_stage: 3
+```
+
+## Citation
+If you find our work helpful, please consider citing.
+```
+@article{aria,
+  title={},
+  author={},
+  year={2024},
+  journal={}
+}
+```
+
 
-Aria supports fine-tuning through methods like LoRA (Low-Rank Adaptation) and full parameter tuning. For detailed instructions and code samples on how to fine-tune Aria, please refer to [docs/finetune.md](docs/finetune.md).
diff --git a/aria/data.py b/aria/data.py
@@ -22,9 +22,8 @@
 from typing import Dict, Iterable, List
 
 import torch
-from datasets.features import Features, Sequence, Value
-
 from datasets import DatasetDict, concatenate_datasets, load_dataset
+from datasets.features import Features, Sequence, Value
 
 
 def apply_chat_template_and_tokenize(

diff --git a/aria/model/vision_processor.py b/aria/model/vision_processor.py
@@ -210,14 +210,25 @@ def __call__(
         return_tensors: Optional[Union[str, TensorType]] = "pt",
         split_image: Optional[bool] = False,
         split_ratio: Optional[List[List[int]]] = [
-            [1, 1],
             [1, 2],
             [1, 3],
             [1, 4],
+            [1, 5],
+            [1, 6],
+            [1, 7],
+            [1, 8],
+            [2, 4],
+            [2, 3],
             [2, 2],
             [2, 1],
             [3, 1],
+            [3, 2],
             [4, 1],
+            [4, 2],
+            [5, 1],
+            [6, 1],
+            [7, 1],
+            [8, 1],
         ],
     ):
         """
@@ -279,14 +290,25 @@ def preprocess(
         return_tensors: Optional[Union[str, TensorType]] = None,
         split_image: Optional[bool] = False,
         split_ratio: Optional[List[List[int]]] = [
-            [1, 1],
             [1, 2],
             [1, 3],
             [1, 4],
+            [1, 5],
+            [1, 6],
+            [1, 7],
+            [1, 8],
+            [2, 4],
+            [2, 3],
             [2, 2],
             [2, 1],
             [3, 1],
+            [3, 2],
             [4, 1],
+            [4, 2],
+            [5, 1],
+            [6, 1],
+            [7, 1],
+            [8, 1],
         ],
     ):
         return self.__call__(

diff --git a/aria/train.py b/aria/train.py
@@ -224,6 +224,7 @@ def main():
     )
 
     trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+    processor.save_pretrained(training_args.output_dir)
 
     trainer.save_model(training_args.output_dir)
 

diff --git a/examples/README.md b/examples/README.md
@@ -1,17 +1,17 @@
 ***This document provides examples to fine-tune Aria on three different datasets: single-image data, multi-image data and video data.***
 
-# Single-Image SFT
+# Fine-tune on single-image dataset
 We use a 30k subset of the [RefCOCO dataset](https://arxiv.org/pdf/1608.00272) as an example.
 RefCOCO is a visual grounding task. Given an image and a description of the reference object as input, the model is expected to output corresponding bounding box. For a given bounding box, we normalize its coordinates to `[0,1000)` and transform it into "(x1,y1), (x2,y2)". Please refer to [RefCOCO_Example](./refcoco/README.md) for more details!
 
 
 
-# Multi-Image SFT
+# Fine-tune on multi-image dataset
 We use the [NLVR2 dataset](https://arxiv.org/abs/1811.00491) as an example. 
 NLVR2 (Natural Language for Visual Reasoning) is a task where given two images, the model needs to determine whether a claim is true by answering yes or no. Please refer to [NLVR2_Example](./nlvr2/README.md) for details!
 
 
-# Video SFT
+# Fine-tune on video dataset
 We use the [NextQA dataset](https://arxiv.org/abs/2105.08276) as an example.
 NextQA requires the model to select an answer from several options according to the video input and question. The model is expected to output the correct option's character. Please refer to [NextQA_Example](./nextqa/README.md) for details!
 
diff --git a/inference/notebooks/01_single_image_understanding.ipynb b/inference/notebooks/01_single_image_understanding.ipynb