Multi-Strategy Fine-Tuning of LLMs for Sequence Classification: BERT and LLaMA with Soft and Hard Prompting

This project demonstrates how to fine-tune large language models (LLMs) like BERT and LLaMA for sequence classification tasks using both soft and hard prompting techniques. It explores different methods of prompting to efficiently adapt pre-trained models for specific tasks.

Techniques Used

  • Standard Fine-tuning with BERT:

    • Model: bert-base-uncased
    • Dataset: GLUE MRPC dataset
    • Layers: All BERT layers are initially frozen, except for the last two encoder layers, which are unfrozen and fine-tuned.
    • Optimizations: Learning rate, batch size, number of epochs, weight decay, and other hyperparameters were tuned for optimal performance.
    • Evaluation Metrics: Accuracy, classification report, confusion matrix, and ROC-AUC curve.
    • Output: The fine-tuned model and tokenizer are saved for further inference.
    •  
  • Soft Prompting:

    • Instead of adding tokens explicitly as in hard prompting, this method introduces trainable soft prompts (embedding vectors) which are prepended to the input token embeddings.
    • The soft prompt model concatenates a learnable soft prompt with the original input embeddings and uses BERT for sequence classification.
    • Evaluation: Accuracy, classification report, confusion matrix, and ROC-AUC were used to evaluate model performance.
    • Output: The soft-prompting model and tokenizer are saved.
    •  
  • Hard Prompting with LLaMA:

    • Hard prompts explicitly structure the task within the input sentences. For instance, the task “Does sentence 1 mean the same as sentence 2?” is embedded into the input directly.
    • Model: LLaMA 3.1 was used for inference with temperature control for deterministic output.
    • Performance was evaluated using classification accuracy and reports.
    •  
  • Evaluation and Analysis:

    • The final evaluation was performed using multiple metrics, such as classification reports, confusion matrices, and ROC-AUC curves for each model.
    • Visualization tools (matplotlib) were used to plot confusion matrices and ROC curves, allowing easy comparison of model performance.
Results

BERT Fine-Tuned Model:

  • Accuracy: 83%
  • F1-Score (Not Equivalent): 0.72
  • F1-Score (Equivalent): 0.88

BERT with Soft Prompting:

  • Accuracy: 77%
  • F1-Score (Not Equivalent): 0.54
  • F1-Score (Equivalent): 0.85

LLaMA 3.1 with Hard Prompting:

    • Accuracy: 73%
    • F1-Score (Not Equivalent): 0.51
    • F1-Score (Equivalent): 0.82

Conclusion

The fine-tuned BERT model performs best overall with the highest accuracy (83%) and balanced precision, recall, and F1-scores across both classes. It outperforms both soft-prompted BERT and hard-prompted LLaMA, particularly in the detection of “Not Equivalent” sentence pairs.