🔥 Code for the IJCAI 2024 paper "FactCHD: Benchmarking Fact-Conflicting Hallucination Detection". 🔥 The whole dataset is available in FactCHD_Dataset on Google Drive or FactCHD_Dataset on Huggingface.
Large Language Models (LLMs), such as ChatGPT/GPT-4, have garnered widespread attention owing to their myriad of practical applications, yet their adoption has been constrained by issues of fact-conflicting hallucinations across web platforms. The assessment of factuality in text, produced by LLMs, remains inadequately explored, extending not only to the judgment of vanilla facts but also encompassing the evaluation of factual errors emerging in complex inferential tasks like multi-hop, and etc. In response, we introduce FACTCHD, a fact-conflicting hallucination detection benchmark meticulously designed for LLMs. Functioning as a pivotal tool in evaluating factuality within "Query-Respons"
contexts, our benchmark assimilates a large-scale dataset, encapsulating a broad spectrum of factuality patterns, such as vanilla, multi-hops, comparison, and set-operation patterns. A distinctive feature of our benchmark
is its incorporation of fact-based chains of evidence, thereby facilitating comprehensive and conducive factual reasoning throughout the assessment process. We evaluate multiple LLMs, demonstrating the effectiveness of the benchmark and current methods fall short of faithfully detecting factual errors. Furthermore, we present TRUTH-TRIANGULATOR that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence.
Inspired by the saying to know it and to know the reason why of it
by Zhuzi
, FACTCHD
incorporates fact-based chains of evidence to provide explanations for its binary predictions. Based on FACTCHD
, we aim to explore the application of fact-conflicting hallucination detection task.
Our benchmark FACTCHD
comprises a comprehensive dataset, consisting of 51,383
factual
/non-factual
samples for training and additional 6,960
samples for LLM analysis and evaluation. It covers a wide range of domains, including health, medicine, climate, science, and more. FACTCHD
endeavors to explore the factuality aspect of LLMs by examining four distinct patterns that encompass individual facts and interactions between multiple facts.
Our automated data construction strategy centers around harnessing a wealth of extensive knowledge (KG), including the data collection process, generation of "Query-Respons"
contexts, fact-based chains of evidence and human filtering and statistical analysis.
To run the codes, you need to install the requirements:
conda create -n fact python=3.9
conda activate fact
pip install -r requirements.txt
mkdir results
mkdir lora
mkdir data
Here are the baselines models in our evaluation:
We provide the visual representation of the dataset’s domain distribution as following figures.
we randomly provide several case examples of our FactLLM in the folder /data
.
id: unique identifier ID.
instruction: instruction for the model indicating a fallacy finder task.
input: includes #Question# and #Answer#.
output: includes labels(NON-FACTUAL or FACTUAL) and justification.
{"id": "common_183152", "instruction": "I want you to act as a 'fallacy finder'. You will be on the lookout for invalid arguments so you can call out any logical errors or inconsistencies that may be present in the #Question# and the #Answer#. Your job is to provide evidence-based feedback and point out any fallacies, faulty reasoning, false assumptions, or incorrect conclusions which may be present in the #Question# and the #Answer#. Begin your response with either FACTUAL or NON-FACTUAL, followed by your justification.", "input": "#Question#: Could you please inform me whether Tata Motors is included in the BSE SENSEX index?\n#Answer#: Tata Motors is excluded from the BSE SENSEX index.", "output": "NON-FACTUAL.The answer that Tata Motors is excluded from the BSE SENSEX index is incorrect. According to the evidence provided, Tata Motors is actually a constituent of the BSE SENSEX index. Therefore, this is a false incorrect conclusion."}
-
Taking the high-quality dense subset Wikidata15k from Wikidata as an example, it includes 794 concepts, 16,960 entities, 363 relations, and 846 properties. The generated subgraphs of the knowledge graph contain four types:
MULTI_HOP_REASONING
、SET_OPERATION
、QUANTITATIVE_COMPARISON
、ALL
. Other knowledge graph datasets can refer to this code. This code involves datasets likeWikidata
andPrimeKG
.cd data_generate python kg_data_generate.py
-
Taking the FEVER dataset as an example, download the preprocessed dataset FEVER. Other datasets in different domains can refer to the following process. This code involves datasets like
FEVER
,Climate-Fever
,Health-Fever
,COVID-FACT
, andSCIFACT
.cd data_generate python text_data_generate.py
output_dir='lora/alpaca-7b-fact'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0,1,2" torchrun --nproc_per_node=3 --master_port=1331 src/finetune.py \
--base_model 'models/alpaca-7b' \
--train_path 'data/fact/train.json' \
--output_dir=${output_dir} \
--batch_size 240 \
--micro_train_batch_size 10 \
--micro_eval_batch_size 10 \
--preprocessing_num_workers 4 \
--num_epochs 10 \
--learning_rate 1e-4 \
--cutoff_len 800 \
--val_set_size 2000 \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--group_by_length \
| tee ${output_dir}/train.log \
2> ${output_dir}/train.err
There are some differents among the finetune of alpaca、vicuna, please refer to scripts/run_*
for more details.
Please make sure the trained lora weights are placed in the path of lora_weights
.
CUDA_VISIBLE_DEVICES="0" python src/inference.py \
--lora_weights 'lora/alpaca-7b-fact' \
--base_model 'models/alpaca-7b' \
--input_file 'data/test.json' \
--output_file 'results/alpaca_7b_fact_test.json'