
Multi-Strategy Fine-Tuning of LLMs for Sequence Classification: BERT and LLaMA with Soft and Hard Prompting
Multi-Strategy Fine-Tuning of LLMs for Sequence Classification: BERT and LLaMA with Soft and Hard Prompting
This project demonstrates how to fine-tune large language models (LLMs) like BERT and LLaMA for sequence classification tasks using both soft and hard prompting techniques. It explores different methods of prompting to efficiently adapt pre-trained models for specific tasks.
Techniques Used
Standard Fine-tuning with BERT:
- Model:
bert-base-uncased
- Dataset: GLUE MRPC dataset
- Layers: All BERT layers are initially frozen, except for the last two encoder layers, which are unfrozen and fine-tuned.
- Optimizations: Learning rate, batch size, number of epochs, weight decay, and other hyperparameters were tuned for optimal performance.
- Evaluation Metrics: Accuracy, classification report, confusion matrix, and ROC-AUC curve.
- Output: The fine-tuned model and tokenizer are saved for further inference.
- Model:
Soft Prompting:
- Instead of adding tokens explicitly as in hard prompting, this method introduces trainable soft prompts (embedding vectors) which are prepended to the input token embeddings.
- The soft prompt model concatenates a learnable soft prompt with the original input embeddings and uses BERT for sequence classification.
- Evaluation: Accuracy, classification report, confusion matrix, and ROC-AUC were used to evaluate model performance.
- Output: The soft-prompting model and tokenizer are saved.
Hard Prompting with LLaMA:
- Hard prompts explicitly structure the task within the input sentences. For instance, the task “Does sentence 1 mean the same as sentence 2?” is embedded into the input directly.
- Model: LLaMA 3.1 was used for inference with temperature control for deterministic output.
- Performance was evaluated using classification accuracy and reports.
Evaluation and Analysis:
- The final evaluation was performed using multiple metrics, such as classification reports, confusion matrices, and ROC-AUC curves for each model.
- Visualization tools (matplotlib) were used to plot confusion matrices and ROC curves, allowing easy comparison of model performance.
Results
BERT Fine-Tuned Model:
- Accuracy: 83%
- F1-Score (Not Equivalent): 0.72
- F1-Score (Equivalent): 0.88
BERT with Soft Prompting:
- Accuracy: 77%
- F1-Score (Not Equivalent): 0.54
- F1-Score (Equivalent): 0.85
LLaMA 3.1 with Hard Prompting:
- Accuracy: 73%
- F1-Score (Not Equivalent): 0.51
- F1-Score (Equivalent): 0.82
Conclusion
The fine-tuned BERT model performs best overall with the highest accuracy (83%) and balanced precision, recall, and F1-scores across both classes. It outperforms both soft-prompted BERT and hard-prompted LLaMA, particularly in the detection of “Not Equivalent” sentence pairs.
Our Latest Projects
Far far away, behind the word mountains, far from the countries Vokalia and Consonantia
About
An AI Geek and a lifelong learner, who thrives in coding and problem-solving through ML, DL, and LLMs.
Copyright ©2024 All rights reserved.