saahil

Audio Classification on Environmental Sounds by analysing Mel-Spectograms with CNN

This project aims to build an audio classification model using Convolutional Neural Networks (CNN) to classify sound clips from the ESC-50 dataset. The dataset contains 50 environmental sound categories, such as animal noises, natural sounds, and human activities. The objective is to create a model that can accurately classify these sounds using audio processing techniques and deep learning.

How It Works

I installed necessary packages like PyTorch for model building, librosa for audio processing, and matplotlib for visualization.

A random seed was set to ensure reproducibility of results, ensuring consistent data splitting, shuffling, and model initialization.

I explored the dataset by visualizing the distribution of samples per category and plotted waveforms, spectrograms, and Mel-spectrograms of selected audio samples.The metadata for the ESC-50 dataset, which contains information about the audio files and their corresponding categories, was loaded.

To enhance the diversity of the dataset and improve model robustness, I implemented several data augmentation techniques:

Adding noise to the audio samples.Time-shifting the audio clips to simulate temporal variations.Time-stretching to modify the playback speed of the audio.Pitch-shifting to alter the pitch of the audio samples.

These augmentations were applied to the original audio samples to create variations that mimic real-world conditions.

For each audio clip, I extracted Mel-spectrogram features, which represent the power of a sound signal in different frequency bands over time.

The audio features were reshaped into a format that is compatible with CNN input, with channels representing different frequency bands over time.I built a CNN architecture specifically designed for audio classification. The model consisted of multiple convolutional layers followed by max-pooling and dropout layers to reduce overfitting. The final fully connected layers were responsible for outputting the probabilities of each class.

The training process involved feeding the CNN model with batches of Mel-spectrograms and their corresponding labels.

I used the Cross-Entropy loss function for classification and the Adam optimizer to minimize the loss during training. The model was trained over multiple epochs, with periodic evaluation on the validation set to monitor performance.

Once the training process was completed and the model achieved satisfactory performance, the trained model was saved for future use. This allows for easy loading and inference on new audio samples.

Our Latest Projects

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia

Audio Classification on Environmental Sounds by analysing Mel-Spectograms with CNN