Created a Classification Machine Learning model to predict the popularity of songs on Spotify for the recommendation engine, which will recommend songs to its listeners based on popular songs and their taste.
3. Problem Statement
Spotify, one of the most popular platform, used by the listeners for songs and
podcasts.
Spotify uses recommendation engine to recommend tracks to the listener in the
discover weekly section according to listener's preference and popularity.
Out of the two factors, popularity of the track is the most important factor, used
by the recommendation engine because it also tells about the popular
preferences of the people, based upon various variables and the heyday period
of the track.
4. Dataset Description
114000 rows and 21 columns
Target Variable
Popularity – This measure vary with past released song with present
released songs because Spotify reshuffles according to monthly
listeners. It is a multiclass variable consisting of 3 categories.
In this, only those variables are taken which are affecting the target
variable. Other variable are not taken into consideration.
5. Key variables
Independent Variable
Continuous:
Categorical:
•Valence% – It is the positiveness of the song. Higher the value is cheerful
and euphoric, lower the value depressing and sad.
•Danceability% – How much the song can be used for dance purpose.
•Energy% – It is the amount of energy a song have
•Acoustic Ness% – It measures the use of natural instruments or
electronically made music.
•Key – It the musical notes which is used in the track, such as 0=C, 1=C#,
and so on. There are total of 12 keys present.
•Tempo – It represents the speed of the song. Higher the tempo higher
faster the song and vice-versa.
•Duration – It represents the length of the song in seconds.
•Speech ness – It represents the amount of vocals/voices present in the
song.
7. EDA Report
• Duplicate values, null values and typo error were present in the data.
• There are huge outliers present in the data, which is treated by converting
them into categories maintaining the balance in the classes.
• Did some Feature engineering such as clubbing, binning and rounding the
data to reduce the classes in the data.
• The target variable “Popularity” was initially in percentage 0-100%.
However, the original data description says that it is classification problem.
So, the target variable is converted from regression to multi-classification.
• The target variable was not-balanced. Oversampling technique was used to
balance the classes.
8. Conversion of target variable percentile
into three categories.
In the histogram below we can see than the target column has a peak at 0, which is represents no
popularity of the tracks, so, it is assigned an independent class of the variable because it will impact
the accuracy of the model. The new classes are ‘zero popularity’, ‘low popularity’, ‘high popularity’
9. Algorithms report
With the different algorithms, the accuracy
is not fluctuation much, represent the
stability in the prediction.
Highest Accuracy = 85.94
Lowest Accuracy = 79.7
Algorithm wise accuracy:
• Random Forest Classifier = 85.94
• Decision Tree Classifier = 79.7
• Cat Boost Classifier = 80.3
• XG Boost Classifier = 82.28
10. Key Findings
• There were 20 independent variables present in the data but only 8 variables were
affecting the popularity of the song.
• Valence, danceability and energy are affecting almost 50% to the popularity.
• Song Genre is one of the most important factor when comes to individual's preference or
taste of music, that recommendation engine considers. The most popular genre is
Country-Specific which consist of Country Wise language songs, indicates people love
mother tongue when it comes to songs. Apart from that most popular genre is EDM
(Electronic Dance Music) because high valence, danceability and energy.
• Medium tempo is 2x popular than any other tempo range which is between 100-140
bpm. This tempo is used in EDM, Rock and Pop music, are the most popular genres.
11. The importance of each column related
to the popularity
The figure in the left shows how much each feature
is affecting the target column.
Valence + danceability + energy
16.7% + 16.0% + 15.5% = 48.2%
First 8 columns or all the 20 columns is giving the
same accuracy.
12. Conclusion
The overall dataset was little complicated because of the difficulty of establishing the
relationship between the target variable and the independent variables. However, with
some cleaning and feature engineering, the final model was stable with high accuracy.
The most difficult differentiation was that, the popularity was getting affecting by the
release date of the song and the release date was not available in the data, so it seemed
like the case of Endogeneity. Nonetheless, After separating the target variable, it got
sorted.
As a Data Scientist, I can conclude that this trained model with the following dataset is
predicting accurately and is ready for deployment in the Spotify recommendation Engine,
to predict the right popularity in future recommending the right tracks to the listeners .