Intro

It turns out Spotify has an outstanding API to connect you to its ubiquitous database of songs and their features. You can, for example, get visual insights from songs you love or integrate a playback into your web application. There is also a powerful song search engine available as well as a recommendation system which helps you listen to more of what you love.

Prerequisites

Let’s start off with signing up at the Spotify official website for zero cost and little effort.

Next, open your application dashboard and hit the “Create an app” button. Input the necessary details and prepare to explore.

Grab your ClientID and Client Secret and start your favourite Python IDE. It’s time to code.

We are going to use a wrapper utility around Spotify API called SpotiPy to make handsome one-line-long requests instead of explicitly reaching the endpoints. Let’s install it.

!pip install spotipy

We then initialise the spotipy.Spotify object with the Spotify developer’s credentials, stored in the variables CLIENT_ID and CLIENT_SECRET.

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Fetch Tracks and Artists

The next step is data querying. Note that we can only fetch information about 50 or less tracks at a time. The parameter in the sp.search() method is where you can search for specific items. The reference is here.

artist_name = []
track_name = []
track_popularity = []
artist_id = []
track_id = []
for i in range(0,1000,50):
track_results = sp.search(q=’year:2021′, type=’track’, limit=50,offset=i)
for i, t in enumerate(track_results[‘tracks’][‘items’]):
artist_name.append(t[‘artists’][0][‘name’])
artist_id.append(t[‘artists’][0][‘id’])
track_name.append(t[‘name’])
track_id.append(t[‘id’])
track_popularity.append(t[‘popularity’])

Put the queried data into the Pandas Dataframe.

import pandas as pd
track_df = pd.DataFrame({‘artist_name’ : artist_name, ‘track_name’ : track_name, ‘track_id’ : track_id, ‘track_popularity’ : track_popularity, ‘artist_id’ : artist_id})
print(track_df.shape)
track_df.head()
Example output of the previous cell

Let’s add information about artists who perform each of the 1000 tracks.

artist_popularity = []
artist_genres = []
artist_followers = []
for a_id in track_df.artist_id:
artist = sp.artist(a_id)
artist_popularity.append(artist[‘popularity’])
artist_genres.append(artist[‘genres’])
artist_followers.append(artist[‘followers’][‘total’])

Now add it to the track_df data frame.

track_df = track_df.assign(artist_popularity=artist_popularity, artist_genres=artist_genres, artist_followers=artist_followers)
track_df.head()
Tracks and artists in a single data frame

Fetch Tracks’ Numerical Features

We are now going to dive into numerical research of our songs, but first, we need to fetch some data. Luckily, Spotify provides us with thorough insights about 82 million songs, which is just right for our purpose.

First, discover what features contribute to a track’s profile at the Spotify API reference page.

Second, fetch the tracks’ features and add them to the data frame.

track_features = []
for t_id in track_df[‘track_id’]:
af = sp.audio_features(t_id)
track_features.append(af)
tf_df = pd.DataFrame(columns = [‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘type’, ‘id’, ‘uri’, ‘track_href’, ‘analysis_url’, ‘duration_ms’, ‘time_signature’])
for item in track_features:
for feat in item:
tf_df = tf_df.append(feat, ignore_index=True)
tf_df.head()

The tracks’ features data frame looks like this:

There are a few redundant columns which we will drop in the following cells

Let’s drop a few useless columns and check the structure of our data frames:

cols_to_drop2 = [‘key’,’mode’,’type’, ‘uri’,’track_href’,’analysis_url’]
tf_df = tf_df.drop(columns=cols_to_drop2)
print(track_df.info())
print(tf_df.info())

The final step before data exploration and visualisation is column types’ inference. This is done manually:

track_df[‘artist_name’] = track_df[‘artist_name’].astype(“string”)
track_df[‘track_name’] = track_df[‘track_name’].astype(“string”)
track_df[‘track_id’] = track_df[‘track_id’].astype(“string”)
track_df[‘artist_id’] = track_df[‘artist_id’].astype(“string”)
tf_df[‘duration_ms’] = pd.to_numeric(tf_df[‘duration_ms’])
tf_df[‘instrumentalness’] = pd.to_numeric(tf_df[‘instrumentalness’])
tf_df[‘time_signature’] = tf_df[‘time_signature’].astype(“category”)
print(track_df.info())
print(tf_df.info())

The resulting data frames have the following structure:

Boring stuff behind. Let’s get to know our data better.

Exploring the Trends of 2021

Looking for the most popular tracks of 2021? There you go:

track_df.sort_values(by=[‘track_popularity’], ascending=False)[[‘track_name’, ‘artist_name’]].head(20)
by_art_fol = pd.DataFrame(track_df.sort_values(by=[‘artist_followers’], ascending=False)[[‘artist_followers’,’artist_popularity’, ‘artist_name’,’artist_genres’]])
by_art_fol.astype(str).drop_duplicates().head(20)

Let’s see how many genres there are in the track_df data frame:

def to_1D(series):
return pd.Series([x for _list in series for x in _list])
to_1D(track_df[‘artist_genres’]).value_counts().head(20)

Visualise the results above:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize = (14,4))
ax.bar(to_1D(track_df[‘artist_genres’]).value_counts().index[:10],
to_1D(track_df[‘artist_genres’]).value_counts().values[:10])
ax.set_ylabel(“Frequency”, size = 12)
ax.set_title(“Top genres”, size = 14)

Find top 20 artists sorted by the number of followers for each of top 10 genres:

top_10_genres = list(to_1D(track_df[‘artist_genres’]).value_counts().index[:20])
top_artists_by_genre = []
for genre in top_10_genres:
for index, row in by_art_fol.iterrows():
if genre in row[‘artist_genres’]:
top_artists_by_genre.append({‘artist_name’:row[‘artist_name’], ‘artist_genre’:genre})
break
pd.json_normalize(top_artists_by_genre)

Find top 20 tracks sorted by popularity for each of top 10 genres:

by_track_pop = pd.DataFrame(track_df.sort_values(by=[‘track_popularity’], ascending=False)[[‘track_popularity’,’track_name’, ‘artist_name’,’artist_genres’, ‘track_id’]])
by_track_pop.astype(str).drop_duplicates().head(20)
top_songs_by_genre = []
for genre in top_10_genres:
for index, row in by_track_pop.iterrows():
if genre in row[‘artist_genres’]:
top_songs_by_genre.append({‘track_name’:row[‘track_name’], ‘track_popularity’:row[‘track_popularity’],’artist_name’:row[‘artist_name’], ‘artist_genre’:genre})
break
pd.json_normalize(top_songs_by_genre)

Visualising Tracks’ Features

Here’s a correlation matrix of the tracks’ features:

import seaborn as sn
sn.set(rc = {‘figure.figsize’:(12,10)})
sn.heatmap(tf_df.corr(), annot=True)
plt.show()

You could also plot a bivariate KDE for a specific pair of variables:

sn.set(rc = {‘figure.figsize’:(20,20)})
sn.jointplot(data=tf_df, x=”loudness”, y=”energy”, kind=”kde”)

How are the most popular tracks different from all the tracks in the dataset? Let’s find out by plotting a feature portrait of the corresponding sets, given mean values of selected features.

feat_cols = [‘danceability’, ‘energy’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’]
top_100_feat = pd.DataFrame(columns=feat_cols)
for i, track in by_track_pop[:100].iterrows():
features = tf_df[tf_df[‘id’] == track[‘track_id’]]
top_100_feat = top_100_feat.append(features, ignore_index=True)
top_100_feat = top_100_feat[feat_cols]
from sklearn import preprocessing
mean_vals = pd.DataFrame(columns=feat_cols)
mean_vals = mean_vals.append(top_100_feat.mean(), ignore_index=True)
mean_vals = mean_vals.append(tf_df[feat_cols].mean(), ignore_index=True)
print(mean_vals)
import plotly.graph_objects as go
import plotly.offline as pyo
fig = go.Figure(
data=[
go.Scatterpolar(r=mean_vals.iloc[0], theta=feat_cols, fill=’toself’, name=’Top 100′),
go.Scatterpolar(r=mean_vals.iloc[1], theta=feat_cols, fill=’toself’, name=’All’),
],
layout=go.Layout(
title=go.layout.Title(text=’Feature comparison’),
polar={‘radialaxis’: {‘visible’: True}},
showlegend=True
)

It looks like the most popular songs are slightly more danceable and feature more valence. They also have zero instrumentality and little liveness.

Get Recommendations

The last step in our analysis is to get track recommendations given artist id, genre, and track id. The output is randomised, so Spotify never runs out of content suggestions.

rec = sp.recommendations(seed_artists=[“3PhoLpVuITZKcymswpck5b”], seed_genres=[“pop”], seed_tracks=[“1r9xUipOqoNwggBpENDsvJ”], limit=100)
for track in rec[‘tracks’]:
print(track[‘artists’][0][‘name’], track[‘name’])

Conclusion

This article gives a brief overview of Spotify Web API’s methods and shows how the fetched data might be analysed and plotted.

Original Source