Hi, welcome!
In this repo, we perform parameter efficient fine tuning on Meta's Llama2-7b, BERT base and XLNet base large language models using Enron email corpus to demonstrate a proof of concept around email classification for Deutsche Bahn.
Following is the workflow:
- Enron email corpus is an open dataset of more than 500,000 raw emails. We work on a subest of ~1700 pre-labeled emails available here - https://data.world/brianray/enron-email-dataset and another subset of chat GPT3.5 labelled ~12000.
- We narrowed down to lables in one layer i.e. we only consider 'Cat_1_level_1'and 'Cat_1_level_2' exclusively for our purpose. The original (layered) labeling methodology is explained in detail here - https://datascience.stackexchange.com/questions/92341/how-to-read-the-labeled-enron-dataset-categories/92737#92737
- 'Regular Expressions' methods are used to clean the heavily unstructured email bodies in the labeled dataset along with lemmatization of email subject and body. ( Dataset is split into training (~1400) and testing (~300) subsets. Final training dataset is hosted on huggingface here - https://huggingface.co/datasets/neelblabla/enron_labeled_email-llama2-7b_finetuning )
- Following models and methodologies are used further:
- Prompts are generated on the final training dataset and parameter efficient finetuning is performed on Llama2-7b using this dataset. The parameters of the models are merged and the final fine-tuned model is hosted in the following huggingface repo - https://huggingface.co/neelblabla/email-classification-llama2-7b-peft
- Model is finally evaluated on the testing dataset (maximum length of input test-prompts restricted to 100) - 65% of model-generated responses have strong indicative signals in the responses matching the human labels of the emails.
- It is observed that model's performance degrades if the maximum length of input test-prompts are increased to 180, 200, 300, etc.
- We employ a pre-trained BERT base model containing 110 million parameters.
- The email content is tokenized using BERT's pre-trained tokenizer.
- To enhance training efficiency, we freeze the first 8 layers of the model, focusing our fine-tuning efforts exclusively on the remaining 4 layers using our training data.
- For comprehensive regularization and optimization, you can refer to the specific methods and hyperparameters detailed in the accompanying code.
- Additionally, we introduce a classifier layer atop the BERT model, tailored to the number of output classes required for the task at hand.
Similar workflow as BERT
PEFT Fine-Tuned Llama2-7b Model - https://huggingface.co/neelblabla/email-classification-llama2-7b-peft
( Labeled Enron Dataset for Fine-Tuning on Llama2-7b - https://huggingface.co/datasets/neelblabla/enron_labeled_email-llama2-7b_finetuning )