saahil

Multi-Strategy Fine-Tuning of LLMs for Sequence Classification: BERT and LLaMA with Soft and Hard Prompting

This project demonstrates how to fine-tune large language models (LLMs) like BERT and LLaMA for sequence classification tasks using both soft and hard prompting techniques. It explores different methods of prompting to efficiently adapt pre-trained models for specific tasks.

Techniques Used

Standard Fine-tuning with BERT:
- Model: bert-base-uncased
- Dataset: GLUE MRPC dataset
- Layers: All BERT layers are initially frozen, except for the last two encoder layers, which are unfrozen and fine-tuned.
- Optimizations: Learning rate, batch size, number of epochs, weight decay, and other hyperparameters were tuned for optimal performance.
- Evaluation Metrics: Accuracy, classification report, confusion matrix, and ROC-AUC curve.
- Output: The fine-tuned model and tokenizer are saved for further inference.
Soft Prompting:
- Instead of adding tokens explicitly as in hard prompting, this method introduces trainable soft prompts (embedding vectors) which are prepended to the input token embeddings.
- The soft prompt model concatenates a learnable soft prompt with the original input embeddings and uses BERT for sequence classification.
- Evaluation: Accuracy, classification report, confusion matrix, and ROC-AUC were used to evaluate model performance.
- Output: The soft-prompting model and tokenizer are saved.
Hard Prompting with LLaMA:
- Hard prompts explicitly structure the task within the input sentences. For instance, the task “Does sentence 1 mean the same as sentence 2?” is embedded into the input directly.
- Model: LLaMA 3.1 was used for inference with temperature control for deterministic output.
- Performance was evaluated using classification accuracy and reports.
Evaluation and Analysis:
- The final evaluation was performed using multiple metrics, such as classification reports, confusion matrices, and ROC-AUC curves for each model.
- Visualization tools (matplotlib) were used to plot confusion matrices and ROC curves, allowing easy comparison of model performance.

Results

BERT Fine-Tuned Model:

Accuracy: 83%
F1-Score (Not Equivalent): 0.72
F1-Score (Equivalent): 0.88

BERT with Soft Prompting:

Accuracy: 77%
F1-Score (Not Equivalent): 0.54
F1-Score (Equivalent): 0.85

LLaMA 3.1 with Hard Prompting:

- Accuracy: 73%
- F1-Score (Not Equivalent): 0.51
- F1-Score (Equivalent): 0.82

Conclusion

The fine-tuned BERT model performs best overall with the highest accuracy (83%) and balanced precision, recall, and F1-scores across both classes. It outperforms both soft-prompted BERT and hard-prompted LLaMA, particularly in the detection of “Not Equivalent” sentence pairs.

Our Latest Projects

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia

Multi-Strategy Fine-Tuning of LLMs for Sequence Classification: BERT and LLaMA with Soft and Hard Prompting