I’ve decided to get more involved in the Tidy Tuesday movement. I think it’s a really enjoyable way to improve my skills in working with data, like analyzing, organizing, and visualizing it. The cool datasets they provide make it even more interesting. More information on Tidy Tuesday and their past datasets can be found here.

This week we have a dataset related to the show Hot Ones. Hot Ones is a unique web series that combines spicy food challenges with celebrity interviews. Hosted by Sean Evans, guests tackle increasingly hot chicken wings while answering questions, leading to candid and entertaining conversations. The show’s blend of heat and honesty has turned it into a global sensation, offering a fresh take on interviews and captivating audiences worldwide.

Let’s see what we can learn from the data 🔬🕵️‍♂️.

Import data

Code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
from scipy import stats
from lets_plot import *
from lets_plot.mapping import as_discrete
LetsPlot.setup_html()

We have three dataframes: sauces, season and episodes. For more information about the data dictionary see the GitHub repo.

Code

sauces=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/sauces.csv')
season=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/seasons.csv')
episodes=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/episodes.csv')

sauces.sample(3)

	season	sauce_number	sauce_name	scoville
96	10	7	Bravado Spice Company – Aka Miso Ghost Reaper	116000
11	2	2	Tapatío	3000
85	9	6	Hell Fire Detroit Habanero	66000

Code

season.sample(3)

	season	episodes	note	original_release	last_release
6	7	12.0	NaN	2018-10-04	2018-12-20
2	3	24.0	NaN	2017-01-19	2017-06-29
14	15	12.0	NaN	2021-05-27	2021-08-12

Code

episodes.sample(3)

	season	episode_overall	episode_season	title	original_release	guest	guest_appearance_number	finished
250	17	251	9	Jacob Elordi Feels Euphoric While Eating Spicy...	2022-03-17	Jacob Elordi	1	True
181	11	182	8	Zac Efron Ups the Ante While Eating Spicy Wings	2020-04-02	Zac Efron	1	True
11	2	12	4	T.J. Miller Talks Deadpool, Hecklers, and Rela...	2016-02-12	T.J. Miller	1	True

Note

After taking a look at all three datasets, I think I will continue working with the sauces and episodes ones.

Questions

My main questions are:

What differences can be observed in the spiciness of sauces as we look across the various seasons?
Has every individual successfully completed all the episodes?
What is the average completion rate per season?
Is there a correlation between Scoville Units and completion rate?
Are there any returning guests?

Data Analysis

What differences can be observed in the spiciness of sauces as we look across the various seasons?

Code

(sauces
 .groupby('season')
 .agg(AVG_scoville=('scoville', 'mean'),
      median_scoville=('scoville', 'median'),
     )
 .reset_index()
 .melt(id_vars='season',
       var_name='statistics',
      )
 .pipe(lambda df: (ggplot(df, aes('season', 'value',fill='statistics'))
                   + geom_bar(stat='identity', show_legend= False)
                   + facet_wrap('statistics',nrow=1,scales='free_y')
                   + labs(x='Seasons',
                          y='Scoville Units'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                  + ggsize(1000,600)
                  )
      )
 
)

Figure 1: Average and median Scoville Units across the Hot Ones’ seasons

There seems to be a shift during the season 3-5 period as can be seen in Figure 1. Both indicators – mean and median Scoville Units – show a consistent upward trend over this time frame and later on they stabilize.

What about the overall spread of the data? 🤔

Code

(sauces
 .loc[:, ['season', 'scoville']]
 .pipe(lambda df: (ggplot(df, aes('season', 'scoville'))
                   + geom_boxplot()
                   + scale_y_log10()
                   + labs(x='Seasons',
                          y='log(Scoville Units)'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                   + ggsize(1000,600)
                  )
      )
)

Figure 2: Spread of Scoville Units across the Hot Ones’ seasons

Here are some observations: Season 5 exhibits the widest range, featuring sauces with Scoville Units spanning from 450 to 2,000,000. In addition, starting from season 6 onwards, the averages, medians, and ranges of Scoville Units appear to even out.

Has every individual successfully completed all the episodes?

To answer this question we will use the episodes dataframe. Keep in mind there are 300 episodes in this dataframe. The finished column can be useful here. Just by looking for entries where finished == False we will have our answer.

Code

(episodes
 .query("finished==False")
 [['season', 'guest', 'guest_appearance_number']]
)

	season	guest	guest_appearance_number
0	1	Tony Yayo	1
7	1	DJ Khaled	1
19	2	Mike Epps	1
20	2	Jim Gaffigan	1
24	2	Rob Corddry	1
51	3	Ricky Gervais	1
90	4	Mario Batali	1
96	5	Taraji P. Henson	1
129	7	Lil Yachty	1
130	7	E-40	1
144	8	Shaq	1
171	10	Chance the Rapper	1
185	12	Eric André	2
218	15	Quavo	1
251	17	Pusha T	1

Taking a closer look, it seems that around 15 participants didn’t make it through the entire Hot Ones interview challenge.Not bad out of 300 shows. And guess what? Eric André popped up on the show not just once, but twice! Now, the big question: did he conquer the hot seat in at least one of those interviews? Let’s plot the data to make it more visual…

Code

(episodes
 .query("finished==False")
 .groupby('season')
 .finished
 .count()
 .to_frame()
 .reset_index()
 .pipe(lambda df: (ggplot(df, aes('season', 'finished'))
                   + geom_bar(stat='identity')
                   + labs(x='Seasons',
                          y='Number of incomplete interviews'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                   + ggsize(600,400)
                  )
      )

)

Figure 3: Number of incomplete interviews per season

Interestingly, a majority of these incomplete interviews belong to season 2 (Figure 3). This fact is quite surprising, especially when you consider that the maximum Scoville value for that season was only 550,000 – nearly a quarter of the following year’s value, where only one person faced difficulty finishing the challenge.

What is the completion rate per season?

To get to the bottom of this question, let’s start by figuring out how many episodes were in each season. We can grab this info from the season dataset. Just a heads-up, in season 9 they seem to threw in an extra episode. So, keep this in mind! Otherwise, you might end up with percentages that go beyond 100%.

Code

# First we need to find out the total number of episodes per season

episodes_per_season = (season
                       [['season', 'episodes', 'note']]
                       .set_index('season')
                       # we need to extract the one extra episode in season 9
                       .assign(note=lambda df: df.note
                               .str.extract(r'([0-9.]+)')
                               .astype(float),
                        # add the two column together
                               episodes=lambda df: df.episodes
                               .add(df.note, fill_value=0)
                              )
                       .drop(columns='note')
                       .squeeze()
                      )

Code

completion_rate = (episodes
                   .query("finished==True")
                   .groupby('season')
                   .finished
                   .sum()
                   .div(episodes_per_season)
                   .mul(100)
                   .to_frame().reset_index()
                   .rename(columns={0:'completion_rate'})
                  )
                   
(completion_rate                  
 .pipe(lambda df: (ggplot(df, aes('season', 'completion_rate'))
                   + geom_line(stat='identity')
                   + labs(x='Seasons',
                          y='% successful participants'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                   + ggsize(600,400)
                  )
      )
)

Figure 4: Completion rate per season

Taking a peek at Figure 4, it seems like the normalized completion rate hits its lowest point in season 1, closely followed by season 7. However, even in those seasons, the rate remains surprisingly high.

Is there a correlation between Scoville Units and completion rate?

Here’s a curious thought: could there be a link between Scoville Units and the completion rate? I’m just wondering if the spiciness level affects how well participants handle the challenge. Exploring this connection might add a spicy twist to the Hot Ones experience – let’s see where the data takes us!

Code

# AVG_scoville code comes from a code snippet 
# 'What differences can be observed in the spiciness 
# of sauces as we look across the various seasons?'

AVG_scoville = (sauces
                .groupby('season')
                .agg(AVG_scoville=('scoville', 'mean'))
                .squeeze()
               )

# Let's calculate the Pearson correlation coefficient. We have to discard the last value
# as the completion rate is not defined for that

stats.pearsonr(AVG_scoville.values[:-1],completion_rate.completion_rate.values[:-1])

PearsonRResult(statistic=0.5039021286250108, pvalue=0.02349225734224539)

The Pearson correlation coefficient has its say: there’s actually a moderate positive correlation(0.5, p<0.05) between Scoville Units and completion rate. Quite intriguing, isn’t it? Honestly, I was expecting the opposite outcome myself! It seems like the higher the average spiciness, the more determined the guests become. Take a look at Figure 5.

Code

(pd.concat([AVG_scoville,completion_rate],axis=1)
 .rename(columns={0:'success_rate'})
 .pipe(lambda df: (ggplot(df, aes('AVG_scoville', 'completion_rate'))
                   + geom_point(size=5, alpha=0.5)
                   + geom_smooth()
                   + labs(x='Average Scoville units',
                          y='% successful participants'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                   + ggsize(600,400)
                  )
      )
)

Figure 5: Correlation between Average Scoville units and Completion rates

Are there any returning guests?

Has there been a brave soul who dared to make a return to the show for a second time? The column guest_appearance_number holds the answers you’re looking for.

Code

(episodes
 .query("guest_appearance_number > 1")
 [['guest','season', 'episode_season', 'finished']]
)

	guest	season	episode_season	finished
124	Eddie Huang	6	13	True
161	Jay Pharoah	9	999	True
183	Tom Segura	12	1	True
185	Eric André	12	3	False
188	T-Pain	12	6	True
189	Adam Richman	12	7	True
190	Action Bronson	12	8	True
203	NaN	13	11	True
214	Russell Brand	14	11	True
215	Steve-O	14	12	True
241	Gordon Ramsay	16	14	True
254	Post Malone	18	1	True

It looks like a total of 12 individuals have taken on the challenge not once, but twice. Hats off to their courage! 🎩

Final words

And there you have it, folks – that’s a wrap! I hope you enjoyed exploring the Hot Ones dataset with me. Rest assured, more of these analyses are in the pipeline for the future. Stay tuned for what’s to come!

--- title: 'Tidy Tuesday: Hot Ones Episodes' author: Adam Cseresznye date: '2023-08-09' categories: - Tidy Tuesday jupyter: python3 toc: true format: html: code-fold: true code-tools: true --- #| label: Hot ones logo #| fig-cap: "fig1" ![](Hot_Ones_by_First_We_Feast_logo.svg.png){fig-align="center" width=50%} I've decided to get more involved in the Tidy Tuesday movement. I think it's a really enjoyable way to improve my skills in working with data, like analyzing, organizing, and visualizing it. The cool datasets they provide make it even more interesting. More information on Tidy Tuesday and their past datasets can be found [here](https://github.com/rfordatascience/tidytuesday). This week we have a dataset related to the show Hot Ones. [Hot Ones](https://www.youtube.com/playlist?list=PLAzrgbu8gEMIIK3r4Se1dOZWSZzUSadfZ) is a unique web series that combines spicy food challenges with celebrity interviews. Hosted by Sean Evans, guests tackle increasingly hot chicken wings while answering questions, leading to candid and entertaining conversations. The show's blend of heat and honesty has turned it into a global sensation, offering a fresh take on interviews and captivating audiences worldwide. Let's see what we can learn from the data 🔬🕵️‍♂️. # Import data ```{python} #| tags: [] import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy import stats from lets_plot import * from lets_plot.mapping import as_discrete LetsPlot.setup_html() ``` We have three dataframes: `sauces`, `season` and `episodes`. For more information about the data dictionary see the [GitHub repo](https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-08/readme.md). ```{python} #| tags: [] sauces=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/sauces.csv') season=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/seasons.csv') episodes=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/episodes.csv') sauces.sample(3) ``` ```{python} #| tags: [] season.sample(3) ``` ```{python} #| tags: [] episodes.sample(3) ``` ::: {.callout-note} After taking a look at all three datasets, I think I will continue working with the `sauces` and `episodes` ones. ::: # Questions **My main questions are:** * What differences can be observed in the spiciness of sauces as we look across the various seasons? * Has every individual successfully completed all the episodes? * What is the average completion rate per season? * Is there a correlation between Scoville Units and completion rate? * Are there any returning guests? # Data Analysis ## What differences can be observed in the spiciness of sauces as we look across the various seasons? ```{python} #| fig-cap: Average and median Scoville Units across the Hot Ones' seasons #| label: fig-fig1 #| tags: [] (sauces .groupby('season') .agg(AVG_scoville=('scoville', 'mean'), median_scoville=('scoville', 'median'), ) .reset_index() .melt(id_vars='season', var_name='statistics', ) .pipe(lambda df: (ggplot(df, aes('season', 'value',fill='statistics')) + geom_bar(stat='identity', show_legend= False) + facet_wrap('statistics',nrow=1,scales='free_y') + labs(x='Seasons', y='Scoville Units' ) + theme(plot_title=element_text(size=20,face='bold')) + ggsize(1000,600) ) ) ) ``` There seems to be a shift during the season 3-5 period as can be seen in @fig-fig1. Both indicators – *mean* and *median* Scoville Units – show a consistent upward trend over this time frame and later on they stabilize. What about the overall spread of the data? 🤔 ```{python} #| fig-cap: Spread of Scoville Units across the Hot Ones' seasons #| label: fig-fig2 #| tags: [] (sauces .loc[:, ['season', 'scoville']] .pipe(lambda df: (ggplot(df, aes('season', 'scoville')) + geom_boxplot() + scale_y_log10() + labs(x='Seasons', y='log(Scoville Units)' ) + theme(plot_title=element_text(size=20,face='bold')) + ggsize(1000,600) ) ) ) ``` Here are some observations: Season 5 exhibits the widest range, featuring sauces with Scoville Units spanning from 450 to 2,000,000. In addition, starting from season 6 onwards, the *averages*, *medians*, and *ranges* of Scoville Units appear to even out. ## Has every individual successfully completed all the episodes? To answer this question we will use the `episodes` dataframe. Keep in mind there are 300 episodes in this dataframe. The `finished` column can be useful here. Just by looking for entries where `finished == False` we will have our answer. ```{python} #| tags: [] (episodes .query("finished==False") [['season', 'guest', 'guest_appearance_number']] ) ``` Taking a closer look, it seems that around 15 participants didn't make it through the entire Hot Ones interview challenge.Not bad out of 300 shows. And guess what? Eric André popped up on the show not just once, but twice! Now, the big question: did he conquer the hot seat in at least one of those interviews? Let's plot the data to make it more visual... ```{python} #| fig-cap: Number of incomplete interviews per season #| label: fig-fig3 #| tags: [] (episodes .query("finished==False") .groupby('season') .finished .count() .to_frame() .reset_index() .pipe(lambda df: (ggplot(df, aes('season', 'finished')) + geom_bar(stat='identity') + labs(x='Seasons', y='Number of incomplete interviews' ) + theme(plot_title=element_text(size=20,face='bold')) + ggsize(600,400) ) ) ) ``` Interestingly, a majority of these incomplete interviews belong to season 2 (@fig-fig3). This fact is quite surprising, especially when you consider that the maximum Scoville value for that season was only 550,000 – nearly a quarter of the following year's value, where only one person faced difficulty finishing the challenge. ![](https://thumbs.gfycat.com/DishonestPoorBluebreastedkookaburra-max-1mb.gif){fig-align="center"} ## What is the completion rate per season? To get to the bottom of this question, let's start by figuring out how many episodes were in each season. We can grab this info from the season dataset. Just a heads-up, in season 9 they seem to threw in an extra episode. So, keep this in mind! Otherwise, you might end up with percentages that go beyond 100%. ```{python} #| tags: [] # First we need to find out the total number of episodes per season episodes_per_season = (season [['season', 'episodes', 'note']] .set_index('season') # we need to extract the one extra episode in season 9 .assign(note=lambda df: df.note .str.extract(r'([0-9.]+)') .astype(float), # add the two column together episodes=lambda df: df.episodes .add(df.note, fill_value=0) ) .drop(columns='note') .squeeze() ) ``` ```{python} #| fig-cap: Completion rate per season #| label: fig-fig4 #| tags: [] completion_rate = (episodes .query("finished==True") .groupby('season') .finished .sum() .div(episodes_per_season) .mul(100) .to_frame().reset_index() .rename(columns={0:'completion_rate'}) ) (completion_rate .pipe(lambda df: (ggplot(df, aes('season', 'completion_rate')) + geom_line(stat='identity') + labs(x='Seasons', y='% successful participants' ) + theme(plot_title=element_text(size=20,face='bold')) + ggsize(600,400) ) ) ) ``` Taking a peek at @fig-fig4, it seems like the normalized completion rate hits its lowest point in season 1, closely followed by season 7. However, even in those seasons, the rate remains surprisingly high. ## Is there a correlation between Scoville Units and completion rate? Here's a curious thought: could there be a link between Scoville Units and the completion rate? I'm just wondering if the spiciness level affects how well participants handle the challenge. Exploring this connection might add a spicy twist to the Hot Ones experience – let's see where the data takes us! ```{python} #| tags: [] # AVG_scoville code comes from a code snippet # 'What differences can be observed in the spiciness # of sauces as we look across the various seasons?' AVG_scoville = (sauces .groupby('season') .agg(AVG_scoville=('scoville', 'mean')) .squeeze() ) # Let's calculate the Pearson correlation coefficient. We have to discard the last value # as the completion rate is not defined for that stats.pearsonr(AVG_scoville.values[:-1],completion_rate.completion_rate.values[:-1]) ``` The Pearson correlation coefficient has its say: there's actually a moderate positive correlation(0.5, *p*<0.05) between Scoville Units and completion rate. Quite intriguing, isn't it? Honestly, I was expecting the opposite outcome myself! It seems like the higher the average spiciness, the more determined the guests become. Take a look at @fig-fig5. ```{python} #| fig-cap: Correlation between Average Scoville units and Completion rates #| label: fig-fig5 #| tags: [] (pd.concat([AVG_scoville,completion_rate],axis=1) .rename(columns={0:'success_rate'}) .pipe(lambda df: (ggplot(df, aes('AVG_scoville', 'completion_rate')) + geom_point(size=5, alpha=0.5) + geom_smooth() + labs(x='Average Scoville units', y='% successful participants' ) + theme(plot_title=element_text(size=20,face='bold')) + ggsize(600,400) ) ) ) ``` ## Are there any returning guests? Has there been a brave soul who dared to make a return to the show for a second time? The column `guest_appearance_number` holds the answers you're looking for. ```{python} #| tags: [] (episodes .query("guest_appearance_number > 1") [['guest','season', 'episode_season', 'finished']] ) ``` It looks like a total of 12 individuals have taken on the challenge not once, but twice. Hats off to their courage! 🎩 # Final words And there you have it, folks – that's a wrap! I hope you enjoyed exploring the Hot Ones dataset with me. Rest assured, more of these analyses are in the pipeline for the future. Stay tuned for what's to come!