Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset for train to translate language #1303

Open
nichellehouston opened this issue Nov 18, 2024 · 1 comment
Open

Dataset for train to translate language #1303

nichellehouston opened this issue Nov 18, 2024 · 1 comment

Comments

@nichellehouston
Copy link

I want to use https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing for training model to translate language with this dataset https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-fr how can use it in notebook Data Prep?

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:
{}

Input:
{}

Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

@Erland366
Copy link
Contributor

For me I'd set the instruction as "Translate this from English to French" (hardcoded) and set the inputs as inputs=examples["translation"]["en"] and outputs as outputs=examples["translation["fr"] or you can just do it vice versa etc (just set whatever the source and destination language)

Then on inference, make sure you use the same template as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants