Dataset for train to translate language #1303

nichellehouston · 2024-11-18T04:06:07Z

I want to use https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing for training model to translate language with this dataset https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-fr how can use it in notebook Data Prep?

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:
{}

Input:
{}

Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Erland366 · 2024-11-18T21:59:26Z

For me I'd set the instruction as "Translate this from English to French" (hardcoded) and set the inputs as inputs=examples["translation"]["en"] and outputs as outputs=examples["translation["fr"] or you can just do it vice versa etc (just set whatever the source and destination language)

Then on inference, make sure you use the same template as well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset for train to translate language #1303

Dataset for train to translate language #1303

nichellehouston commented Nov 18, 2024

Erland366 commented Nov 18, 2024

Dataset for train to translate language #1303

Dataset for train to translate language #1303

Comments

nichellehouston commented Nov 18, 2024

Erland366 commented Nov 18, 2024