CHECK OUT MY SLIDESHOW PRESENTATION ON MY DATA
NOTE: The data folder is not in this repo due to GitHub's size limit, here is where you can find the tracks.csv file needed for this project to run.
NOTE: The Spotify API auth key is also missing for security purposes. In order to get your own Spotify API key follow these instructions
- ID ◼
- Name ◼
- Popularity ★
- Explicit ♦
- Artists (◼,◼)
- Release Date ◼
- Danceability ♣
- Energy ♣
- Key ★
- Loudness ♣
- Speechiness ♣
- Acousticness ♣
- Instrumentalness ♣
- Liveness ♣
- Valence ♣
- Mode ★
- Tempo ♣
- Time Signature ★
- Duration ♣
- Release Year ★
- Name ◼
- Popularity ★
- Danceability ♣
- Duration ♣
- Release Year ★
Note: The boxplots exclude outliers
- Explicit
- Duration
-
H0: Explicit = Non Explicit
-
HA: Explicit > Non Explicit
-
Using a bootstrapping technique we can simulate grabbing multiple means of sample data sets from our sample
-
95% Confidence Intervals
-
Non Explicit: (26.72, 26.74)
-
Explicit: (45.68, 45.69)
-
-
Since my confidence intervals never overlap, I will reject the null hypothesis. There is enough evidence to show the mean population popularity between explicit songs and non-explicit songs is greater in the explicit songs.
-
H0: Popularity of songs with length less than or equal to 5 minutes is equal to the Popularity of songs with length greater than 5 minutes
-
HA: Popularity of songs with length less than or equal to 5 minutes is greater than the Popularity of songs with length greater than 5 minutes
-
Using the Central Limit Theorem I will make a normal model and using Welch's T-Test I will calculate my P-value.
- At an α level of 0.05 and with a P-value of 1.57e-34 I will Reject the Null Hypothesis. There is enough evidence to show that the population popularity mean for songs less than or equal to 5 min in length is greater than that of songs greater than 5 min in length.
I was able to create a couple helper functions that would simplify the process of comparing two different artists using the Spotify API and a given metric. Through that, we can go ahead an make faster hypothesis tests.
Using the Central Limit Theorem we can conduct Welch's T-Test on the wanted metric and from there receive a P-Value
We can assume our Null and Alternative Hypothesis go as follows:
-
H0: ArtistOneMetric = ArtistTwoMetric
-
HA: ArtistOneMetric > ArtistTwoMetric
Below you'll find the doc strings for the functions that make this process simply for you to use.
def GetTwoArtists(artist_one, artist_two, years=None, nofeatures=True, metric='popularity', save=False, nameAppend=""):
"""Get the test for two different artists according to a metric.
NOTE: CompareArtistsCLT() is called from within this function
Parameters
----------
artist_one : list<track>
a list of analysis track objects
artist_two : list<track>
a list of analysis track objects
years : string
years to look for in spotify data
nofeatures : boolean
do we want data with features
metric : string
the metric to look for
save : boolean
save figure?
nameAppend : string
text to append to filename
Returns
-------
artist_one
artist_one track analysis
artist_two
artist_two track analysis
"""
def CompareArtistsCLT(self, artists, metric='popularity', labels=[], save=False, nameAppend=""):
"""Compare artists metrics using central limit theorem and t testing
Parameters
----------
artists : list<track data>
artists data to use
metric : string
metric to measure
labels : list<string>
list of strings to label our normal models
save : boolean
save figure?
nameAppend : string
text to append to data
Returns
-------
float
p value of our t test
"""
Below you'll find some sample tests I've conducted as well as their result