Advanced DeBERTa-Powered Language Identifier using Gradio

The main purpose of this project is to develop a robust and efficient language identification system that can accurately detect the language of a given text input. By leveraging a pretrained DeBERTa model, the project aims to fine-tune the model for the specific task of language identification, enabling it to distinguish between multiple languages with high accuracy. The system is designed to be easily deployable, allowing users to interact with it via a simple web interface, where they can input text and receive the predicted language as output. This project not only demonstrates the application of state-of-the-art natural language processing techniques but also provides a practical tool for real-world language detection needs.

Here’s a breakdown of everything I did

Imported Libraries: I imported libraries like pandas, numpy, matplotlib, seaborn, torch, and transformers for data manipulation, visualization, and model training.

Loaded and Inspected Data: I loaded the dataset from dataset.csv, checked for duplicates, and handled missing values. Then, I visualized the distribution of languages using charts.

Prepared Data: I created the LanguageIdentificationDataset class to preprocess text data using the DeBERTa tokenizer. I also defined the CustomDebertaModel class, which builds on DeBERTa with additional layers. Split Data and Created DataLoaders: I encoded language labels, split the data into training and validation sets, and created DataLoader instances for batch processing.

Trained the Model: I set up the train_model function to train the CustomDebertaModel using the Trainer from the transformers library. I specified training arguments and computed metrics.

Ran Main Function: I ran the main function to load the dataset, initialize the tokenizer, prepare the data, train the model, and save the model state and tokenizer. I also saved the label encoder classes.

Evaluated the Model: I loaded the saved model and tokenizer, then ran predictions on sample texts. I decoded the predictions to determine the language of each text.

Created Gradio Interface: I set up a Gradio interface to make the model accessible. The interface allows users to input text, which the model then processes to predict the language