Fine-tuning DistilBERT for Sentiment Analysis
Fine-tuning DistilBERT for Sentiment Analysis
!pip install datasets transformers huggingface_hub
import torch
!pip install accelerate>=0.20.1
from datasets import load_dataset
imdb = load_dataset("imdb")
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(1000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(100))])
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
import numpy as np
from datasets import load_metric
def compute_metrics(eval_pred):
load_accuracy = load_metric("accuracy")
load_f1 = load_metric("f1")
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
return {"accuracy": accuracy, "f1": f1}
from huggingface_hub import notebook_login
notebook_login()
from transformers import TrainingArguments, Trainer
repo_name = "finetuning-sentiment-model-3000-samples"
training_args = TrainingArguments(
output_dir=repo_name,
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
save_strategy="epoch",
push_to_hub=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.evaluate()
trainer.push_to_hub()
from transformers import pipeline
sentiment_model = pipeline(model="Saahil1801/finetuning-sentiment-model-3000-samples")
sentiment_model(["This movie is awesome", "This movie is bad!"])
# Label 1 means positive and Label 0 is negative.
Code Explanation
Below is the code workflow without the installation commands.
Import necessary libraries:Import
torch
.Import required modules fromdatasets
,transformers
, andhuggingface_hub
.Load and preprocess dataset:Load the IMDB dataset using
load_dataset
function fromdatasets
.Shuffle and select a small subset of the training and test datasets.Initialize tokenizer:Load a pre-trained tokenizer (
distilbert-base-uncased
) fromtransformers
.Define a preprocessing function:Create a function to tokenize text data using the loaded tokenizer.
Tokenize the datasets:Apply the preprocessing function to the training and test datasets using the
map
method.Prepare data collator:Initialize
DataCollatorWithPadding
with the tokenizer.Load pre-trained model:Load a pre-trained sequence classification model (
distilbert-base-uncased
) fromtransformers
.Define metrics computation function:Create a function to compute accuracy and F1 score using
load_metric
fromdatasets
.Log into Hugging Face Hub:Use
notebook_login
to authenticate and log into the Hugging Face Hub.Set up training arguments:Configure training parameters such as learning rate, batch size, number of epochs, and saving strategy using
TrainingArguments
.Initialize the Trainer:Create a
Trainer
object with the model, training arguments, datasets, tokenizer, data collator, and metrics computation function.Train the model:Call the
train
method on theTrainer
object to start training.Evaluate the model:Call the
evaluate
method on theTrainer
object to evaluate the model on the test dataset.Push model to Hugging Face Hub:Use the
push_to_hub
method to upload the trained model to the Hugging Face Hub.Load and use the model pipeline:Initialize a sentiment analysis pipeline with the trained model.Use the sentiment analysis pipeline to make predictions on new text data.
Our Latest Projects
Far far away, behind the word mountains, far from the countries Vokalia and Consonantia
About
An AI Geek and a lifelong learner, who thrives in coding and problem-solving through ML, DL, and LLMs.
Copyright ©2024 All rights reserved.