Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/rhymes-ai/Aria
Browse files Browse the repository at this point in the history
  • Loading branch information
Coobiw committed Oct 3, 2024
2 parents 507ed2c + b531137 commit 53c38aa
Show file tree
Hide file tree
Showing 22 changed files with 3,292 additions and 29 deletions.
3 changes: 2 additions & 1 deletion .isort.cfg
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
[settings]
profile=black
profile=black
skip=datasets
136 changes: 115 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,34 @@

[😊 Hugging Face](#) |
[📄 Paper](#) |
[📰 Blog](#) |
[📚 Tutorial](#) |
[💻 Demo](#) |
[🌐 Website](#) |
[📚 Blog](#) |
[🌐 WebDemo](#)


## Introduction
Aria is the first open MoE model that is natively multimodal. It features SoTA performance on OCR and video understanding tasks, competitve performance on language and coding tasks, and fast inference speed with merely 3.9B activated parameters per token.

| Category | Benchmark | Aria | Pixtral 12B | Llama3 8B | Llama3-V 8B | GPT-4V | GPT-4o mini | GPT-4o | Gemini-1.5 Flash | Gemini-1.5 Pro |
Aria is a multimodal native MoE model. It features:
- State-of-the-art performance on various multimodal and language tasks, superior in video and document understanding;
- Long multimodal context window of 64K tokens;
- 3.9B activated parameters per token, enabling fast inference speed and low fine-tuning cost.

<!--
| Category | Benchmark | Aria | Pixtral 12B | Llama3.2 11B | Llama3-V 8B | GPT-4V | GPT-4o mini | GPT-4o | Gemini-1.5 Flash | Gemini-1.5 Pro |
|-------------------------------------|-------------------------|-------|-------------|-----------|-------------|--------|-------------|--------|------------------|----------------|
| **Knowledge(Multimodal)** | MMMU | 54.2 | 52.5 | - | 49.6 | 56.4 | 59.4 | 69.1 | 56.1 | 62.2 |
| **Math(Multimodal)** | MathVista | 64.1 | 58.0 | - | - | - | 54.7 | 63.8 | 58.4 | 63.9 |
| **Document** | DocQA | 92.9 | 90.7 | - | 84.4 | 88.4 | - | 92.8 | 89.9 | 93.1 |
| **Chart** | ChartQA | 86.1 | 81.8 | - | 78.7 | 78.4 | - | 85.7 | 85.4 | 87.2 |
| **Knowledge(Multimodal)** | MMMU | 54.9 | 52.5 | - | 49.6 | 56.4 | 59.4 | 69.1 | 56.1 | 62.2 |
| **Math(Multimodal)** | MathVista | 66.1 | 58.0 | - | - | - | 54.7 | 63.8 | 58.4 | 63.9 |
| **Document** | DocQA | 92.6 | 90.7 | - | 84.4 | 88.4 | - | 92.8 | 89.9 | 93.1 |
| **Chart** | ChartQA | 86.4 | 81.8 | - | 78.7 | 78.4 | - | 85.7 | 85.4 | 87.2 |
| **Scene Text** | TextVQA | 81.1 | - | - | 78.2 | 78.0 | - | - | 78.7 | 78.7 |
| **General Visual QA** | MMBench-1.1 | 81.1 | - | - | - | 79.8 | 76.0 | 82.2 | - | 73.9 |
| **Video Understanding** | LongVideoBench | 64.0 | 47.4 | - | - | 60.7 | 58.8 | 66.7 | 62.4 | 64.4 |
| **Knowledge(Language)** | MMLU (5-shot) | 73.6 | 69.2 | 69.4 | - | 86.4 | - | 89.1 | 78.9 | 85.9 |
| **Math(Language)** | MATH | 50.0 | 48.1 | 51.9 | - | - | 70.2 | 76.6 | - | - |
| **General Visual QA** | MMBench-1.1 | 80.3 | - | - | - | 79.8 | 76.0 | 82.2 | - | 73.9 |
| **Video Understanding** | LongVideoBench | 65.3 | 47.4 | - | - | 60.7 | 58.8 | 66.7 | 62.4 | 64.4 |
| **Knowledge(Language)** | MMLU (5-shot) | 73.3 | 69.2 | 69.4 | - | 86.4 | - | 89.1 | 78.9 | 85.9 |
| **Math(Language)** | MATH | 50.8 | 48.1 | 51.9 | - | - | 70.2 | 76.6 | - | - |
| **Reasoning(Language)** | ARC Challenge | 91.0 | - | 83.4 | - | - | 96.4 | 96.7 | - | - |
| **Coding** | HumanEval | 75.6 | 72.0 | 72.6 | - | 67.0 | 87.2 | 90.2 | 74.3 | 84.1 |

| **Coding** | HumanEval | 73.2 | 72.0 | 72.6 | - | 67.0 | 87.2 | 90.2 | 74.3 | 84.1 |
-->

## News
- 2024.10.10: We release Aria!

## Quick Start

Expand All @@ -42,9 +45,9 @@ pip install flash-attn --no-build-isolation

### Inference

The total number of parameters in Aria is about 25B, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.
Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.

Performing inference is simple with the Hugging Face ecosystem:
Here is a code snippet to show you how to use Aria with Hugging Face Transformers.

```python
import requests
Expand Down Expand Up @@ -94,6 +97,97 @@ print(result)

We offer additional inference methods, such as utilizing [VLLM](https://github.com/vllm-project/vllm) for enhanced performance. For comprehensive details, please refer to [docs/inference.md](docs/inference.md).

### Fine-tuning
### Cookbook
Checkout these [inference examples](https://github.com/rhymes-ai/Aria/tree/main/inference/notebooks) that demonstrate how to use Aria on various applications such as chart understanding, PDF reading, video understanding, etc.

## Fine-tuning

We offer both LoRA fine-tuning and full parameter tuning, using various dataset types:
- Single-image datasets
- Multi-image datasets
- Video datasets

For a quick try, visit the [examples](./examples) folder and choose one of the fine-tuning examples.

### Prepare dataset
Please refer to [custom_dataset.md](custom_dataset.md) for how to prepare your dataset.

### Fine-tune with LoRA

After preparing your dataset, follow these steps to fine-tune Aria using LoRA:

1. Open the configuration file `recipes/config_lora.yaml`. Locate the `dataset_mixer` section and update it with your dataset paths:

```yaml
dataset_mixer:
"path/to/dataset1": 1
"path/to/dataset2": 0.5
"path/to/dataset3": 2
```
> **Note on dataset mixing:** Aria supports combining multiple datasets with different sampling rates. In the example above:
> - `dataset1` will be used entirely (weight 1)
> - `dataset2` will use 50% of its data (weight 0.5)
> - `dataset3` will be used twice (weight 2)

2. Start the fine-tuning process by running the following command on one A100 (80GB) or H100 (80GB) GPU:

```bash
python aria/train.py --config recipes/config_lora.yaml
```

3. For multi-GPU training, use the [`accelerate`](https://huggingface.co/docs/accelerate/index) library:

```bash
accelerate launch --config_file recipes/accelerate_configs/zero2.yaml aria/train.py --config recipes/config_lora.yaml --num_processes [number_of_gpus]
```

- Choose from pre-configured accelerate settings in `recipes/accelerate_configs/`
- Adjust the `--num_processes` argument to match your available GPUs
- For custom configurations, refer to the [accelerate documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)

4. Inference with the fine-tuned model:

See [inference with LoRA support](inference.md#2-inference-with-lora-support) for how to inference with the fine-tuned model.

### Full parameter fine-tuning

Everything is the same as the LoRA fine-tuning process, except for the configuration file `recipes/config_full.yaml`.

Full parameter tuning consumes more GPU memory, thus multiple GPUs are required. The following command has been tested on 8 A100 (80GB) GPUs.

```bash
accelerate launch --config_file recipes/accelerate_configs/zero2.yaml aria/train.py --config recipes/config_full.yaml
```

If you encounter out-of-memory errors, try reducing the `per_device_train_batch_size` in the config file. Adjust the `gradient_accumulation_steps` accordingly to maintain the effective training batch size.

```yaml
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
```

Memory consumption varies across datasets. Generally, more memory is required for multi-image and video datasets. Adjust the `deepspeed_config` parameters to optimize memory consumption, such as using `zero_stage` 3 and offloading parameters and optimizer to the CPU.

```yaml
deepspeed_config:
gradient_accumulation_steps: auto
gradient_clipping: auto
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero_stage: 3
```

## Citation
If you find our work helpful, please consider citing.
```
@article{aria,
title={},
author={},
year={2024},
journal={}
}
```
Aria supports fine-tuning through methods like LoRA (Low-Rank Adaptation) and full parameter tuning. For detailed instructions and code samples on how to fine-tune Aria, please refer to [docs/finetune.md](docs/finetune.md).
3 changes: 1 addition & 2 deletions aria/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,8 @@
from typing import Dict, Iterable, List

import torch
from datasets.features import Features, Sequence, Value

from datasets import DatasetDict, concatenate_datasets, load_dataset
from datasets.features import Features, Sequence, Value


def apply_chat_template_and_tokenize(
Expand Down
26 changes: 24 additions & 2 deletions aria/model/vision_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,14 +210,25 @@ def __call__(
return_tensors: Optional[Union[str, TensorType]] = "pt",
split_image: Optional[bool] = False,
split_ratio: Optional[List[List[int]]] = [
[1, 1],
[1, 2],
[1, 3],
[1, 4],
[1, 5],
[1, 6],
[1, 7],
[1, 8],
[2, 4],
[2, 3],
[2, 2],
[2, 1],
[3, 1],
[3, 2],
[4, 1],
[4, 2],
[5, 1],
[6, 1],
[7, 1],
[8, 1],
],
):
"""
Expand Down Expand Up @@ -279,14 +290,25 @@ def preprocess(
return_tensors: Optional[Union[str, TensorType]] = None,
split_image: Optional[bool] = False,
split_ratio: Optional[List[List[int]]] = [
[1, 1],
[1, 2],
[1, 3],
[1, 4],
[1, 5],
[1, 6],
[1, 7],
[1, 8],
[2, 4],
[2, 3],
[2, 2],
[2, 1],
[3, 1],
[3, 2],
[4, 1],
[4, 2],
[5, 1],
[6, 1],
[7, 1],
[8, 1],
],
):
return self.__call__(
Expand Down
1 change: 1 addition & 0 deletions aria/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@ def main():
)

trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
processor.save_pretrained(training_args.output_dir)

trainer.save_model(training_args.output_dir)

Expand Down
6 changes: 3 additions & 3 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
***This document provides examples to fine-tune Aria on three different datasets: single-image data, multi-image data and video data.***

# Single-Image SFT
# Fine-tune on single-image dataset
We use a 30k subset of the [RefCOCO dataset](https://arxiv.org/pdf/1608.00272) as an example.
RefCOCO is a visual grounding task. Given an image and a description of the reference object as input, the model is expected to output corresponding bounding box. For a given bounding box, we normalize its coordinates to `[0,1000)` and transform it into "(x1,y1), (x2,y2)". Please refer to [RefCOCO_Example](./refcoco/README.md) for more details!



# Multi-Image SFT
# Fine-tune on multi-image dataset
We use the [NLVR2 dataset](https://arxiv.org/abs/1811.00491) as an example.
NLVR2 (Natural Language for Visual Reasoning) is a task where given two images, the model needs to determine whether a claim is true by answering yes or no. Please refer to [NLVR2_Example](./nlvr2/README.md) for details!


# Video SFT
# Fine-tune on video dataset
We use the [NextQA dataset](https://arxiv.org/abs/2105.08276) as an example.
NextQA requires the model to select an answer from several options according to the video input and question. The model is expected to output the correct option's character. Please refer to [NextQA_Example](./nextqa/README.md) for details!

580 changes: 580 additions & 0 deletions inference/notebooks/01_single_image_understanding.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit 53c38aa

Please sign in to comment.