Florence 2, released by Microsoft in June 2024, is a foundation vision-language model. This model is very attractive because of its small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks.
Florence supports captioning, object detection, OCR, and more out of the box. However, your task might not be supported, or you might need to control the model's output for your task. That's when you will need to fine-tune the model.
In this repository, we present code to fine tune Florence-2-large-ft on DocVQA. In case you want a quick play with data, visit https://huggingface.co/datasets/zhangfaen/DocumentVQA/
Note: Florence-2-large-ft is already a SFT version of Florence-2-large (https://huggingface.co/microsoft/Florence-2-large). In this repo, we continue fine-tuning Florence-2-large-ft to make it have new skills.
To get started, run the following commands:
conda create -n florence2-finetuning python=3.11 -y
conda activate florence2-finetuning
git clone https://github.com/zhangfaen/finetune-Florence-2-large-ft
cd finetune-Florence-2-large-ft
pip install -r requirements.txt
If you encounter issues with flash-attn, you can fix it with the following command:
pip install -U flash-attn --no-build-isolation
For this experiment, we use the DocumentVQA dataset.
from datasets import load_dataset
data = load_dataset('zhangfaen/DocumentVQA')
# zhangfaen/DocumentVQA dataset is a snapshot of 'HuggingFaceM4/DocumentVQA' from huggingface , in case in future huggingface team deletes/changes it, we make a copy.
print(data)
Output:
DatasetDict({
train: Dataset({
features: ['questionId', 'question', 'question_types', 'image', 'docId', 'ucsf_document_id', 'ucsf_document_page_no', 'answers'],
num_rows: 39463
})
validation: Dataset({
features: ['questionId', 'question', 'question_types', 'image', 'docId', 'ucsf_document_id', 'ucsf_document_page_no', 'answers'],
num_rows: 5349
})
test: Dataset({
features: ['questionId', 'question', 'question_types', 'image', 'docId', 'ucsf_document_id', 'ucsf_document_page_no', 'answers'],
num_rows: 5188
})
})
We have put Florence-2-large-ft modeling files under dir_of_this_repo/model/ We still need a Florence-2-large-ft model checkpoint, run below bash commands to get it:
# in root dir of this repo
cd model
wget https://huggingface.co/zhangfaen/Florence-2-large-ft-checkpoint/resolve/main/pytorch_model.bin
mv pytorch_model.bin pytorch_model.by.microsoft.bin
Then we can use below python code to load Florence-2-large-ft model checkpoint
# Below line will load ./model/pytorch_model.by.microsoft.bin
model = AutoModelForCausalLM.from_pretrained("./model", trust_remote_code=True, variant="by.microsoft").to(device)
To train with just one GPU, you can simply run:
python train.py
It will automatically train on the DocumentVQA dataset.
The distributed_train.py
script allows you to train the Florence-2 model using distributed data parallelism, which can significantly speed up the training process when using multiple GPUs. Below are the steps to use this script:
python distributed_train.py
check out playground.ipynb
This repo is built based on
- https://huggingface.co/datasets/HuggingFaceM4/DocumentVQA
- https://huggingface.co/microsoft/Florence-2-large
- https://github.com/andimarafioti/florence2-finetuning
Many thanks to them for the great model/data/code!