saahil

Fine-tuning DistilBERT for Sentiment Analysis

				
					!pip install datasets transformers huggingface_hub

import torch

!pip install accelerate>=0.20.1

from datasets import load_dataset
imdb = load_dataset("imdb")

small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(1000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(100))])

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)

tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)


import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")

   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}


from huggingface_hub import notebook_login
notebook_login()

from transformers import TrainingArguments, Trainer

repo_name = "finetuning-sentiment-model-3000-samples"

training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=True,
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)


trainer.train()

trainer.evaluate()

trainer.push_to_hub()

from transformers import pipeline

sentiment_model = pipeline(model="Saahil1801/finetuning-sentiment-model-3000-samples")

sentiment_model(["This movie is awesome", "This movie is bad!"])

<!--[{'label': 'LABEL_1', 'score': 0.8584421277046204},-->
<!-- {'label': 'LABEL_0', 'score': 0.8439799547195435}]-->

# Label 1 means positive and Label 0 is negative.

Code Explanation

Below is the code workflow without the installation commands.

Import necessary libraries:Import torch.Import required modules from datasets, transformers, and huggingface_hub.
Load and preprocess dataset:Load the IMDB dataset using load_dataset function from datasets.Shuffle and select a small subset of the training and test datasets.
Initialize tokenizer:Load a pre-trained tokenizer (distilbert-base-uncased) from transformers.
Define a preprocessing function:Create a function to tokenize text data using the loaded tokenizer.
Tokenize the datasets:Apply the preprocessing function to the training and test datasets using the map method.
Prepare data collator:Initialize DataCollatorWithPadding with the tokenizer.
Load pre-trained model:Load a pre-trained sequence classification model (distilbert-base-uncased) from transformers.
Define metrics computation function:Create a function to compute accuracy and F1 score using load_metric from datasets.
Log into Hugging Face Hub:Use notebook_login to authenticate and log into the Hugging Face Hub.
Set up training arguments:Configure training parameters such as learning rate, batch size, number of epochs, and saving strategy using TrainingArguments.
Initialize the Trainer:Create a Trainer object with the model, training arguments, datasets, tokenizer, data collator, and metrics computation function.
Train the model:Call the train method on the Trainer object to start training.
Evaluate the model:Call the evaluate method on the Trainer object to evaluate the model on the test dataset.
Push model to Hugging Face Hub:Use the push_to_hub method to upload the trained model to the Hugging Face Hub.
Load and use the model pipeline:Initialize a sentiment analysis pipeline with the trained model.Use the sentiment analysis pipeline to make predictions on new text data.

Our Latest Projects

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia

Fine-tuning DistilBERT for Sentiment Analysis

Fine-tuning DistilBERT for Sentiment Analysis

Code Explanation

Our Latest Projects

Simplified ETL Pipeline using Apache Airflow from NewsAPI and into MySQL

Audio Classification on Environmental Sounds by analysing Mel-Spectograms with CNN

Implementing Spark Pipelines for Mobile Usage and Behavior Analytics (Leveraging SparkSQL)

About

Links

Skills

Have Questions?