Skip to content

Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""

License

Notifications You must be signed in to change notification settings

Jiaxin-Wen/MisleadLM

Repository files navigation

Language Models Learn to Mislead Humans via RLHF

This repository contains data and code for our paper:

Language Models Learn to Mislead Humans via RLHF

1. Installation

conda create -n mislead python=3.10
pip install -e .

2. RLHF Training

2.1 Programming

cd src/programming
python reward_api.py
bash train.sh

2.2 Question Answering

cd src/qa/reward
bash train_judge.sh # task-specific reward training
bash train_preference.sh # general reward training

cd ..
CUDA_VISIBLE_DEVICES=6 python reward_api.py # general reward
CUDA_VISIBLE_DEVICES=7 python judge_api.py # task-specific reward
bash train.sh

3. Fine-tuned Checkpoints

About

Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages