What Gives a Country a Diverse Music Taste?

James McGhee, Neha Shijo, Varunika Tewani

Introduction

New styles of music can originate and take hold anywhere, a phenomenon which has led to the creation of hundreds of new genres. People all around the world have different music taste, some local and others global, but some countries might have more homogeneous preferences than others. In our project we decided to investigate whether certain factors like number of incoming immigrants, median age, or size of a countries urban population, correlate with more diverse music taste in a country. Is lower median age an indicator of more diversity in the music of a country, since younger people might have broader music taste than their older counterparts? Do more incoming migrants to a country more unique music genres to that country as well? If a country has a higher urban population will they be exposed to more types of music, and thus have more diverse taste.

Our project attempts to determine what genres of music are most popular in 34 countries around the world, and answer the question about: what factors might make a country have greater quantity and diversity of musical genre?

Data Collection

In order to answer this question, we used the following Kaggle dataset. It contains all of the songs from Spotify's Daily Top 200 charts in 35 countries from 2017 to 2020. The dataset contains 170633 rows and 151 fields, including Title, Artist, Album, Genre, and Country.

Link: https://www.kaggle.com/pepepython/spotify-huge-database-daily-charts-over-3-years?select=Final+database.csv

We also used the following Kaggle dataset which contains information about countries population and several other characterisitics(including number of incoming immigrants, median age, or size of a countries urban population), ranging over the years 1995-2020.

Link: https://www.kaggle.com/themlphdstudent/countries-population-from-1955-to-2020

First, we import the necessary packages and read in our dataset, which is located in a CSV file.

Since 151 columns can be cumbersome to read, let's keep only the necessary columns for a visually cleaner table.

Next, let's run some basic preliminary statistics on the dataset. Out of 170633 songs we have 47045 unique songs, 25524 unique artists, 34696 unique albums, 1120 unique genres and 34 unique countries.

Data Visualization and Analysis

Next, let's try making some graphs on measures of the data.

First, a bar chart of the distribution of songs among countries. We'll make an auxiliary dataframe to store the number of songs per country, and plot it in decreasing order. After plotting, it looks like the countries which have the most "Spotify's Daily Top 200" songs in the data set are Switzerland, Taiwan, Sweden, Germany and Finland, with 7686, 7594, 6970, 6942 and 6783 songs each. In contrast, Ecuador, Peru, Philippines, Mexico, and Costa Rice are least represented, with 2660, 2701, 2806, 2833, and 2975 songs each.

Now we'll make a treemap to show the distibution of songs among genres. In order to do this, we group the data set by genre in order to link each genre with a number of songs. We then drop the genre 'n-a', which is a null value, and sort by the number of songs in each genre. Lastly, we use Squarify to plot the visual. As we can see, the most popular genres are dance pop, latin, pop, k-pop, and german hip hop, with 25351, 7591, 7146, 4053, and 3834 songs in each genre.

Now that we know what the most popular genres of music are, let's see how the number of songs per genre is distributed. From describing the dataset and creating a boxplot, it looks like the mean is 148.955317, the standard deviation is 902.533572, and the range varies from 1 as a minimum and 25351 as a maximum number of songs.

Let's move into analysis of the Artists appearing in this dataset. First, what is the distribution of followers per artist? Note that this dataset defines "Artist_followers" as the number of followers the artist had on Spotify on the 5th of November 2020. From the description and boxplot, we can see the mean is ~1755574 followers. The standard deviation is ~5412057, and the range varies from a minimum of 0 to a maximum of ~71783101 followers per artist. What are the most popular artists that have the greatest number of followers? Ed Sheeran, with 71783101 followers, Ariana Grande with 52571724 followers, Drake with 50593376 followers, Rihanna with 39741508 followers, and Justin Bieber with 39214943 followers.

Now let's start working with our population dataset. Since the Spotify data ranges from 2017-2020, were going to filter out all other years from our data. The relevant variables we want to test are a Urban Population Percentage and Median Age so we'll keep those columns, along with Migrants (net) since we will need it to calculate a standardized "Immigrant Ratio" score. Instead of directly comparing the number of immigrants each country brings in, we will divide that number by the country's poulation to get a immigrant-to-citizen population ratio which we can compare between countries of different sizes. After dividing the Net number of Migrants by the Population, we scale the number up by a factor of 10000 just to make the scores more readable, and create a column of these new "Immigrant Ratio" scores.

Note: The Immigrant Ratio for China, Inidia, Tokelau, and Holy See are all 0 because these countries had negative Net Migrants (More people left than came), so their numbers were artificially set to 0.

Now that we have both of our datasets and they have been cleaned, we are going to look at which countries are in both datasets.

Notice that in the Spotify dataset, United States and United Kingdom are labeled as US and UK. Before we create our list of countries that are in both datasets, we will update their country names in the population dataset. The Spotify dataset has 35 countries while the population dataset has 235 countries. Also notice that in the Spotify dataset, one of the labels is "Global" which we will not be using since that is not a country, so we will only be looking at 34 countries for the rest of the project. After looking through the countries in each dataset, we are able to condense a list for the countries we are focusing on.

Now that we have our condensed list of countries, we will update the Spotify data and population data to only store the countries we are looking at.

Now that we have condensed the data to only show the countries that are in both datasets, we are going to take a look at how much of each genre is in each country's top charts. A pie chart is good for this case since we get to visually see the proportions of each genre in that country's top charts between 2017 and 2020. Since there are a significant amount of countries, we chose three countries to make pie charts for: Argentina, United States, and United Kingdom. The pie charts are a little difficult to read since the countries have a lot of unique genres, but it is a good indicator of which genres are the most popular within each country and which countries have more unique genres than others.

Next, we take a look at the population data. The three main columns we want to focus on are Immigrant Ratio, Median Age, and Urban Population Percentage for each of our 34 countries. We thought it would be good to see how these changed over 2017 to 2020, so we decided to plot a line graph for each of these numbers. For Immigrant Ratio, the general trend for each country is that is seemed to decrease slowly or stay pretty constant over the course of 4 years. Then, for Median Age, most countries stayed cosntant until 2019 and then increased for 2020. Lastly, for Urban Population Percentage, for most countries it either stay constant or slowly increased from 2017 to 2020.

In order to get a count of unique genres in each country, we're need to aggregate our Spotify dataset. First we create a groups for every unique "Country"/"Genre" pair, and then we count the number of times each country appears in that list of groups. This way we count how many unique genres a country has.

EX: 1) Beginning Dataset 2) Argentina - Pop - 5 songs, Argentina - Rock - 4 songs, Argentina - Rap - 2 songs, Australia - Rap - 1, Australia - Funk-2 3) Argentina - 3 genres, Australia - 2 genres

Now we have our unique genre dataset compiled, so lets make the other tables we'll need to test our variables correlation. First, we can isolate each country and its Immigrant Ratio before taking the average of that Immigrant Ratio over the 4 year span. Here are the results:

We can repeat this process of isolating a varaible and taking its average over the 4 years for the next 2 variables.

Note: Singapore is in line 26 and does not have data for its urban population, so we need to drop it.

Since we have the average of each variable over the course of the 4 years for each of the 34 countries, we made bar graphs for Immigrant Ratio, Median Age, and Urban Population Percentage to comapare each countries numbers with each other.

Testing Regression

Let’s see if there is any correlation between our variables of interest, and the number of unique genres a country listens to! We can start with Immigrant Ratio, and make a scatter plot where each point is a country whose x-position represents their immigrant ratio, and whose y-position represents the number of unique genres they have.

The scatter plot and regression line don’t inspire too much confidence about correlation between the two, but lets check the correlation coefficients just to quantify our thoughts.

Unfortunately, with an R-squared value of 0.167, the two variables seems to have a very weak correlation. Let’s try with another variable! (Median Age)

Intresting! This set of points seems to follow a linear trend much closer than the last set did. The scatter plot seems to say that there is a positive correlation between Median Age and Number of Unique Genres. Let’s check the statistics on that too.

A R-Squared score of .544 is far better than before, and indicates a real correlation between the two variables. Lets look at our last variable.(Urban Population %)

Note: We have to drop the 26th row because Singapore has NaN listed for their Urban Pop %

This doesn’t look too promising either. Lets look at the summary of the two variables' relationship.

A score of .003, means we are pretty safe to assume that there is absolutely no correlation between Urban Population % and Number of Unique Genres.

Conclusion

Looking at all 3 scores as a whole, it seems like the only relevant variable in predicting a countries muscial diversity might be their Median Age. However, the results show a positive correlation between Median age and Number of Unique Genres, the opposite of what we predicted at the beginning of our exploration. What could this mean? It’s possible that our hypothesis was incorrect, and that maybe older people actually have more diverse music taste than younger people. A more likely explanation though is that there is a confounding variable, something that both Median Age and Number of Unique Genres positively correlate with. Its possible that countries with higher median ages are more developed and see more interation with other cultures through more traffic in and out of their country. It’s also possible that a higher median age means more grown-ups who are willing to buy Spotify subscriptions, and listen to unique genres. Any number of things could explain the correlation between the two, and the answer will be a little vague until we can do more analysis. However, we can be confident in answering our hypothesis that: Urban Population Percentage and Immigrant Ratio are poor indicators of the number of unique music genres a country listens to, and that Median Age seems to be positively correlated with the number of unique genres. Perhaps in the future, we can explore the reason behind the correlation we found, or look for other indicators that predict genre diversity better.