Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Classifying audio files using ML algorithms.

License

NotificationsYou must be signed in to change notification settings

ashcode028/Music-Genre-Classification

Repository files navigation

This was made as a part of Machine Learning course at IIIT,Delhi.Link to blog

Dataset Description

We are using theGTZAN dataset which contains a total of 1000 audio files in .wav format divided into 10 genres. Each genre has 100 songs of 30-sec duration. Along with the audio files, 2 CSV files containing features of the audio files. The files contain mean and variance calculated over multiple features that can be obtained from an audio file. The other file has the same composition, but the songs were divided into 3-second audio files.

Genres

Audio signal feature extraction:

We convert every audio file to signals with a sampling rate to analyze its characteristics.Every waveform has its features in ttwo forms:

  • Time domain- nothing much information about music quality can be extracted and explored apart from visual distinction in the waveforms
  • Frequency domain which we get after fourier transform of two types: Spectral features and Rhythm features

Spectral

MFCCSpectogram

Rhythm features

MFCC and Rhythm feature plots provide a matrix based information for the unique features. Both the features have been mapped with the duration of the music file.

Preprocessing

After extraction of features, all columns were not null. So extra values were not added.Why is it important to preprocess the data?

  • The variables will be transformed to the same scale.
  • So that all continuous variables contribute equally
  • Such that no biased results
  • PCA very sensitive to variances of initial variablesIf large variance range difference between features , the one with larger range will dominate
  • The boxplots of each feature shows some features have very large differences in their variances.
  • PCA with both normalisation(minMaxScaler) and standardisation(StandardScaler) is done and difference noted.

Methodology

Feature extraction -> correlation matrix -> PCA

  • With 30 secs sample
  • With 3 secs sample
  • Less outliers/ variance for some classes found in principal components:

Inferences till this step:

  • pca.explained_variance_ratio_=[0.20054986 0.13542712]Shows pc1 holds 20% percent of the data, pc2 holds 13% of the data
  • Big clusters of metal , rock, pop ,reggae, classical can be seen.
  • Jazz ,country, are separable to one extent.
  • Hip-hop,disco,blues are very dispersed and can’t be seen
  • Majority are easily separable classes
  • Decided to proceed to modelling phase by using 3 sec sampled feature set with standardization as it aggregated the genres into more linearly separable clusters than normalisation

Classification:

Logistic

This model is a predictive analysis algorithm based on the concept of probability. GridSearchCV was used to pass all combinations of hyperparameters one by one into the model and the best parameters were selected.

Without Hyperparameter tuning:

Confusion Matrix - Logistic Regression Base ModelROC Curve Base Model

MetricValue
Accuracy score0.67267
Precision0.74126
Recall0.74098

Using Hyperparameter tuning:

Confusion Matrix - Logistic Regression After HTROC Curve after HT

MetricValue
Accuracy score0.70504
Precision0.70324
Recall0.71873

SGD Classifier

Took SGD as baseline model and performed hyperparameter tuning for a better performance.Though difference werent that great even after HP tuning.

Without Hyperparameter tuning:

MetricValue
Accuracy score0.6126126126126126
Precision0.6142479131341332
Recall0.6172558275062101

With Hyperparameter tuning:

MetricValue
Accuracy score0.6441441441441441
Precision0.6386137102787109
Recall0.6421140902032518

Gaussian NB

We used a Simple Naive Bayes classifier, one vs Rest Naive Bayes as baseline models.Then used Hyperparameter testing to get better performance.

Without Hyperparameter tuning:

MetricValue
Accuracy score0.48598598598598597
Precision0.4761542269197442
Recall0.4902979078811803

With Hyperparameter tuning:

MetricValue
Accuracy score0.5155155155155156
Precision0.49864157768533374
Recall0.5050696700999591

KNN

This model almost outperformed compared to Gaussian NB models.As we can see , after HP tuning , correlation between the features has decreased, some had even 0 correlation.

Without Hyperparameter tuning:

MetricValue
Accuracy score0.8603603603603603
Precision0.8594536380364758
Recall0.8583135066852872

Using hyperparameter tuning :

MetricValue
Accuracy score0.9059059059059059
Precision0.9073617032054686
Recall0.905944266718195

Best params:{'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'uniform'}

Decision Trees

  • Took DT as baseline model which didnt give great results, with accuracy around 65%.

MetricValue
Accuracy score0.637758505670447
Precision0.6396387192624916
Recall0.6376582879474517
  • Used ADA boosting which reduced the performance(rock,pop,disco)

MetricValue
Best parametersn_estimators=100
Accuracy score0.5010006671114076
Precision0.48730102839842837
Recall0.4992406459587978
  • Then gradient boosting which increased the accuracy exponentially.

MetricValue
Best parametersn_estimators=100
Accuracy score0.8238825883922615
Precision0.8266806080093154
Recall0.8232200760446549
  • CatBoost was having high AUC for all genres unlike gradient which had low accuracy for some genres

  • Cat boost outperformed ensemble methods. Gradient boost was close enough with 82% accuracy, rest all were in between 50-60%
MetricValue
Best parametersloss function:”Multiclass”
Accuracy score0.8972648432288192
Precision0.8979267969111706
Recall0.8972734276109252

Random Forest

  • As shown here RF was having around 80% accuracy but XGB boosting reduced the accuracy to 75%

MetricValue
Best parametersn_estimators=1000 max_depth=10
Accuracy score0.8038692461641094
Precision0.805947955999254
Recall0.8026467091527609
  • Cross Gradient Boosting on Random Forest reduced the accuracy , it even reduced precision ,recall to large extent.

MetricValue
Best parametersobjective= 'multi:softmax'
Accuracy score0.7505003335557038
Precision0.7593347049139745
Recall0.7494976488750396

XGB Classifier

  • Correlation matrix shows there is very less correlation among variables

  • Best performed model among all DT and RF models with every genre was classified with atleast 85+% accuracy
  • Genres like classical,hiphop had even 100% accuracy
  • XGBoost improves upon the basic Gradient Boosting Method framework through systems optimization and algorithmic enhancements.
  • Evaluations
MetricValue
Best parameterslearning rate:0.05, n_est =1000
Accuracy score0.9072715143428952
Precision0.9080431364823143
Recall0.9072401472896423

MLP

This model is an Artificial Neural Network involving multiple layers and each layer has a considerable number of activation neurons. The primary training included random values of hyperparameters except the activation function . This initiation reflected overfitting in the data for different activation functions :

ActivationTraining AccuracyTesting Accuracy
relu0.98871427774429320.5206666588783264
sigmoid0.9414285421371460.4970000088214874
tanh0.99971431493759160.49266666173934937
softplus0.99914288520812990.5583333373069763

From the following graph, we choose softplus to be the best activation function, considering softmax to be fixed for output
Upon looking the graph, we can conclude a very high variance in testing and training accuracy and so we know that our model is overfitting. In fact the testing loss starts to increase which indicates a high cross entropy loss. This will be dealt later. For now we see that softplus, relu and sigmoid, all 3 have performed similar on training and testing set thus we will go with softplus since it provides a little less variance than others.

Hyperparameter tuning has been done manually by manipulating the following metrics:

  • Learning rate
    activation = softmax
    no. of hidden layers = 3; neurons in each = [512,256,64]
    activation of output layer is fixed to be softmax epochs = 100

Learning RateTraining AccuracyTesting Accuracy
0.010.40442857146263120.335999995470047
0.0010.98885715007781980.5666666626930237
0.00010.96842855215072630.5513333082199097
0.000010.71342855691909790.4996666610240936

From the above graphs we see that 0.01 definitely results in over convergence and bounces as reflective from the accuracy graph. 0.001 has a very high variance and loss increases margianally with low acuracy so it isn't appropriate as well.

The best choice for alpha is either 0.0001 or 0.00001.
0.00001 has a relatively low variance and loss converges quickly with epochs but accuracy on training and testing set is pretty low.
0.0001 has a better performance but variance is very high

  • no.of hidden layers
    activation = softmax
    learning rate = 0.0001
    activation of output layer is fixed to be softmax epochs = 100

Number of layersTraining AccuracyTesting Accuracy
20.97828572988510130.5383333563804626
30.98699998855590820.5443333387374878
40.99214285612106320.5506666898727417

In conclusion, increasing or decreasing the number of layers have no effect on variance. This is because we have too many neurons per layer. So we take 3 layers and reduce the number of neurons.

  • Number of neurons
    activation = softmax
    learning rate = 0.0001
    number of layers = 3
    activation of output layer is fixed to be softmax epochs = 100
    drop out probability = 0.3
    alpha = 0.001

Number of neuronsTraining AccuracyTesting Accuracy
[512, 256, 128]0.99842858314514160.563666641712188
[256, 128, 64]0.9151428341865540.5149999856948853
[180, 90, 30]0.79914283752441410.503000020980835
[128, 64, 32]0.69914287328720090.4900000095367431

Now for the same neuron set, we apply regularization and neuron dropout to find any change in the variance for high number of neurons with reducing the number of neurons

  • regularization and decomposition

Number of neuronsTraining AccuracyTesting Accuracy
[512, 256, 128]0.67599999904632570.5830000042915344
[256, 128, 64]0.52785712480545040.5189999938011169
[180, 90, 30]0.436428576707839970.4629999995231628
[128, 64, 32]0.3864285647869110.4203333258628845

So in conclusion, if we have high number of neurons per layer, then applying regularization techniques will increase the accuracy and decrease the variance overall. If we do not apply any regularization techniques then we can have moderate number of neurons to have a decent accuracy on training and testing set with low accuracy.

For our purposes, we select high number of neurons per layer with regularization

Final MLP model

From all our analysis and extra experimentation we conclude our model with following metrics:

  • activation : softmax
  • learning rate : 0.0001
  • number of hidden layers = 3
  • number of neurons in each layer = [512,256,128]
  • epochs = 100
  • regularization and dropout true

Precision on the model : 0.5774000692307671
Recall on the model : 0.583
F1score on the model : 0.5801865223684216
Accuracy on the model : 0.6130000042915345

Even after hyperparameter tuning, the best accuracy is just above 60%. The reason is simply because of overfitting and underperformance due to inability to pick up each feature. This creates amazing accuracy on the training set but always misses out on the testing set.

SVM

This model outperformed every other model and gave the best accuracy. Manual hyperparameter tuning was done. Linear, polynomial and RBF kernel were compared using confusion matrix.

Best Linear Kernel Model:

Best SVM linear kernelPlot of Classification Report - SVM linear Best

MetricValue
Best parametersC=1.0,kernel='linear',random_state=0
Accuracy score0.70672342343265456
Precision0.7180431364823143
Recall0.71234655872896242

Best Polynomial Kernel Model:

Best SVM poly kernel of degree 7Plot of Classification Report - SVM Poly Best

MetricValue
Best parametersC=1.0,kernel='poly',degree=7
Accuracy score0.88242715143428952
Precision0.8780431364823143
Recall0.87035601472896557

Best RBF Kernel Model:

Best SVM rbf kernel c=200 gamma=4Plot of Best Classification Report - SVM Best RBF

MetricValue
Best parametersC=200,kernel='rbf',gamma=4
Accuracy score0.9424715143428952
Precision0.939297323879391
Recall0.9372401472896423

Conclusions:

  • SVMs performed the best among all classifiers with 94% accuracy
  • Gaussian outperformed polynomial kernel in almost all iterations
  • XGB classifiers were the best among all ensembling methods with 90% accuracy.
  • Since genre classes were balanced , the tradeoff between precision and recall was less observed.
  • Among all KNN,DT and ensemble classifiers , precision was more than recall
  • While in case of LR,SGD,NB,MLP,SVM recall was observed more than precision.

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp