Fine-tuning DistilBERT for Sentiment Analysis

				
					!pip install datasets transformers huggingface_hub

import torch

!pip install accelerate>=0.20.1

from datasets import load_dataset
imdb = load_dataset("imdb")

small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(1000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(100))])

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)

tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)


import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")

   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}


from huggingface_hub import notebook_login
notebook_login()

from transformers import TrainingArguments, Trainer

repo_name = "finetuning-sentiment-model-3000-samples"

training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=True,
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)


trainer.train()

trainer.evaluate()

trainer.push_to_hub()

from transformers import pipeline

sentiment_model = pipeline(model="Saahil1801/finetuning-sentiment-model-3000-samples")

sentiment_model(["This movie is awesome", "This movie is bad!"])

<!--[{'label': 'LABEL_1', 'score': 0.8584421277046204},-->
<!-- {'label': 'LABEL_0', 'score': 0.8439799547195435}]-->

# Label 1 means positive and Label 0 is negative.
				
			

Code Explanation

Below is the code workflow without the installation commands.

  1. Import necessary libraries:Import torch.Import required modules from datasets, transformers, and huggingface_hub.

  2. Load and preprocess dataset:Load the IMDB dataset using load_dataset function from datasets.Shuffle and select a small subset of the training and test datasets.

  3. Initialize tokenizer:Load a pre-trained tokenizer (distilbert-base-uncased) from transformers.

  4. Define a preprocessing function:Create a function to tokenize text data using the loaded tokenizer.

  5. Tokenize the datasets:Apply the preprocessing function to the training and test datasets using the map method.

  6. Prepare data collator:Initialize DataCollatorWithPadding with the tokenizer.

  7. Load pre-trained model:Load a pre-trained sequence classification model (distilbert-base-uncased) from transformers.

  8. Define metrics computation function:Create a function to compute accuracy and F1 score using load_metric from datasets.

  9. Log into Hugging Face Hub:Use notebook_login to authenticate and log into the Hugging Face Hub.

  10. Set up training arguments:Configure training parameters such as learning rate, batch size, number of epochs, and saving strategy using TrainingArguments.

  11. Initialize the Trainer:Create a Trainer object with the model, training arguments, datasets, tokenizer, data collator, and metrics computation function.

  12. Train the model:Call the train method on the Trainer object to start training.

  13. Evaluate the model:Call the evaluate method on the Trainer object to evaluate the model on the test dataset.

  14. Push model to Hugging Face Hub:Use the push_to_hub method to upload the trained model to the Hugging Face Hub.

  15. Load and use the model pipeline:Initialize a sentiment analysis pipeline with the trained model.Use the sentiment analysis pipeline to make predictions on new text data.