This is a repository for OLAPH: Improving Factuality in Biomedical Long-form Question Answering by Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, and Jaewoo Kang.
MedLFQA | Self-BioRAG (OLAPH) | BioMistral (OLAPH) | Mistral (OLAPH) | Summary | Paper
- MedLFQA is a reconstructed format of long-form question-answering (LFQA) benchmark datasets in biomedical domain to facilitate automatic evaluation especially factuality (e.g., hallucination & comprehensiveness).
- OLAPH is a framework that reduces hallucinations and includes crucial claims by utilizing automatic evaluation to select the best response in sampling predictions and designing to answer questions in preferred manner.
[June 28, 2024] We've got a first citation today! It targets conformal prediction using our MedLFQA dataset. Wonderful work from Stanford!
[June 08, 2024] We provide A/B test result from 3 medical experts using 9 MedPALM criteria in Human-Evaluation
.
[May 31, 2024] Introducing two videos: OLAPH (Korean) & OLAPH (English) in youtube!
[May 30, 2024] update the codes to train and inference for Gemma-7b, Llama-3-8b, and Llama-3-8b-Instruct.
[May 23, 2024] OLAPH has been released.
- Installation
- Quick Usage
- Datasets
- Training
- Inference
- Iterative Learning
- FactScore
- FAQ
- Citation
- Contact Information
Please create a conda environment by running the command below. Note that we use two different environments to train and inference. I will ensure that everything is integrated into a single environment and functions properly in the future.
First, you have to install following alignment-handbook.
We use PyTorch v2.1.2, which is important for reproducibility!
Since this is dependent on your environmental settings, please follow and use compatible version of Pytorch from here \
Then, we install the remaining package dependencies as follows:
conda create -n olaph python=3.10
conda activate olaph
cd ./alignment-handbook/ \
python -m pip install .
This could lead us to install for the most recent version of torch. However, we use CUDA 11.8 version in our experimental settings. Thus, we recommend you to download a below code to reproduce our results \
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
You will need Flash Attention 2 installed \
python -m pip install flash-attn==2.5.6 --no-build-isolation
We need further requirments to install for automatic evaluation or vllm for boosting inference speed \
pip install -r requirements.txt --no-build-isolation
pip install git+https://github.com/lucadiliello/bleurt-pytorch.git
Also, you will need to log into your Huggingface account (make sure your account token should be in WRITE status) Then, install the Git LFS to upload your models as follows:
huggingface-cli login
sudo apt-get install git-lfs
You can download 7B models trained with our OLAPH framework from HuggingFace hub.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "dmis-lab/self-biorag-7b-olaph" # ["mistralai/Mistral-7B-v0.1", "BioMistral/BioMistral-7B", "meta-llama/Llama-2-7b-hf", "dmis-lab/selfbiorag_7b", "epfl-llm/meditron-7b"]
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
query = "Can a red eye be serious?"
input_ids = tokenizer.encode(query, return_tensors="pt").to(device)
output = model.generate(input_ids, max_length=512, no_repeat_ngram_size=2, do_sample=False, top_p=1.0).to(device)
response = tokenizer.decode(output[0], skip_special_tokens=True).strip()
print ("Model prediction: ", response)
Yes, a Red Eye can be a sign of a serious condition or a complication of another underlying illness \
or injury. hopefully, this short guide has helped you understand the different causes of red eyes \
and how to properly identify and treat them.If you ever have persistent or severe redness, it is \
important to seek medical attention from a healthcare professional.
MedLFQA is a reconstructed format of long-form question-answering (LFQA) benchmark datasets in biomedical domain to facilitate automatic evaluation. We construct the MedLFQA with four biomedical LFQA benchmark datasets: LiveQA, MedicationQA, HealthSearchQA, and K-QA. Our MedLFQA instance is comprised of four components: question (Q), long-form answer (A), Must Have Statements (MH), Nice to Have Statements (NH). We provide the reconstructed datasets for automatic evaluation of long-form generated responses.
- Sampling Predictions (Including Automatic Evaluation)
Note that you have to generate all predictions of MedLFQA datasets to proceed further SFT and DPO training.
# For first sampling predictions \
export DATA_NAME=live_qa \
export HUGGINGFACE_MODEL_DIR=dmis-lab/selfbiorag_7b \
CUDA_VISIBLE_DEVICES=0 python pdata_collection.py \
--model_name_or_path ${HUGGINGFACE_MODEL_DIR} \
--eval_data ${DATA_NAME} \
# Sampling prediction during Iterative learning (i.e., after SFT or DPO) \
export HUGGINGFACE_MODEL_DIR=your_trained_model \
CUDA_VISIBLE_DEVICES=0 python pdata_collection.py \
--model_name_or_path ${HUGGINGFACE_MODEL_DIR} \
--eval_data ${DATA_NAME} \
# Make supervised fine-tuning dataset as follows
export WODATA_NAME=kqa_golden # it must be different compared to DATA_NAME \
python pred_to_sft.py \
--model_name_or_path ${HUGGINGFACE_MODEL_DIR} \
--wodata_name ${WODATA_NAME} \
- Supervised Fine-Tuning (SFT)
After we obtain sampled predictions from previous step, we use SFT to recognize the question-answering task. Rather than training on human-annotated answer or pseudo-optimal responses generated by GPT-4, we set a self-generated response as a labeled asnwer to remove the depedency on resources in annotation datasets. We use a representative 7B model for Self-BioRAG. If you want to use another models with difference configuration, you should change directions of recipes.
cd alignment-handbook \
CUDA_VISIBLE_DEVICES=0,1,2,3 ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file recipes/accelerate_configs/deepspeed_zero3.yaml \
--num_processes 4 \
scripts/run_sft.py \
recipes/selfbiorag_7b/sft/config_full.yaml \
- Make synthetic preference set based on sampling predictions
export HUGGINGFACE_MODEL_DIR=your_trained_model
export DATA_NAME=kqa_golden
export WODATA_NAME=kqa_golden
python pred_to_preference.py \
--model_name ${HUGGINGFACE_MODEL_DIR} \
--wodata_name ${WODATA_NAME} \
--alpha 1.0 \
--beta 1.0 \
--gamma 1.0 \
--threshold 200 \
python pred_to_preference.py \
--model_name ${HUGGINGFACE_MODEL_DIR} \
--wodata_name ${WODATA_NAME} \
--data_names ${DATA_NAME} \
--alpha 1.0 \
--beta 1.0 \
--gamma 1.0 \
--threshold 200 \
- Direct Preference Optimization (DPO)
cd alignment-handbook \
CUDA_VISIBLE_DEVICES=0,1,2,3 ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file recipes/accelerate_configs/deepspeed_zero3.yaml \
--num_processes 4 \
scripts/run_dpo.py \
recipes/selfbiorag_7b/sft/config_full.yaml \
Note that you should convert two things as follows:
1. convert without dataset name in scripts/run_sft.py and scripts/run_dpo.py
2. convert model_name_or_path in config file for iterative training
We knew that iteartive learning is uncomfortable to follow, thus we try to fix it as soon as possible.
We train and generate sampling predictions through separate files and do several times. In future, we will provide the processes execution in one simple bash file.
Our iterative learning consists of the following processes
- Sampling predictions (
pdata_collection.py
) - Make SFT set (pred_to_sft.py
) - SFT (
alignment-handboook/sft.sh
) - Sampling predictions (pdata_collection.py
) - Make preference set (pred_to_preference.py
) - DPO (
alignment-handboook/dpo.sh
) - Sampling predictions (pdata_collection.py
) - Make preference set (pred_to_preference.py
) - DPO (
alignment-handboook/dpo.sh
) - Sampling predictions (pdata_collection.py
) - Make preference set (pred_to_preference.py
) - DPO (
alignment-handboook/dpo.sh
)
We provide detail experimental settings and results in FActScore.
1. Providing each step sampling & SFT & DPO results?
A. We provide sampling results of every 7B models in the following folder alignment-handbook/predictions/
.
2. [A/B Testing] Open sourcing about Human evaluation for K-QA datasets
A. We provide gpt-4 evaluation and 3 medical experts evaluation about A/B testing in Human-Evaluation
folder.
3. When using Wikipedia as the knowledge source, it seems the topics need to be titles of the Wikipedia pages. I wonder what topics you use for datasets like K-QA?
A. We manually extracted biomedical or medical named entities from the questions in the K-QA dataset, as they were intuitively recognizable. If you want to utilize this in an automatic way, you could combine it with a named entity recognition model to extract the entities, then perform normalization. By doing this, you can construct a knowledge source using retrieved chunks of entities that have corresponding pages on Wikipedia.
4. Is it possible to share the biomedical knowledge source that you built for Factscore?
A. I prefer you to look at the following url Self-BioRAG to use our biomedical knowledge source!
@article{jeong2024olaph,
title={OLAPH: Improving Factuality in Biomedical Long-form Question Answering},
author={Jeong, Minbyul and Hwang, Hyeon and Yoon, Chanwoong and Lee, Taewhoo and Kang, Jaewoo},
journal={arXiv preprint arXiv:2405.12701},
year={2024}
}
For help or issues using MedLFQA & OLAPH, please submit a GitHub issue.
Please contact Minbyul Jeong (minbyuljeong (at) korea.ac.kr
) or Hyeon Hwang (hyeon-hwang (at) korea.ac.kr
) for communication related to OLAPH.