Generate Cantonese Instruction dataset by Gemini Pro using Stanford's Alpaca prompts for fine-tuning LLMs. this repo contain a script to generate the dataset and manually translate seed prompts to Cantonese from Alpaca repo.
You can find the generated dataset on Huggingface here.
pip install -r requirements.txt
export GOOGLE_AISTUDIO_API_KEY=YOUR_API_KEY
python generate.py
@misc{alpaca,
author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
title = {Stanford Alpaca: An Instruction-following LLaMA model},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
}