Tidy Tuesday: Hot Ones Episodes

Tidy Tuesday
Author

Adam Cseresznye

Published

August 9, 2023

I’ve decided to get more involved in the Tidy Tuesday movement. I think it’s a really enjoyable way to improve my skills in working with data, like analyzing, organizing, and visualizing it. The cool datasets they provide make it even more interesting. More information on Tidy Tuesday and their past datasets can be found here.

This week we have a dataset related to the show Hot Ones. Hot Ones is a unique web series that combines spicy food challenges with celebrity interviews. Hosted by Sean Evans, guests tackle increasingly hot chicken wings while answering questions, leading to candid and entertaining conversations. The show’s blend of heat and honesty has turned it into a global sensation, offering a fresh take on interviews and captivating audiences worldwide.

Let’s see what we can learn from the data 🔬🕵️‍♂️.

Import data

Code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
from scipy import stats
from lets_plot import *
from lets_plot.mapping import as_discrete
LetsPlot.setup_html()

We have three dataframes: sauces, season and episodes. For more information about the data dictionary see the GitHub repo.

Code
sauces=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/sauces.csv')
season=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/seasons.csv')
episodes=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/episodes.csv')

sauces.sample(3)
season sauce_number sauce_name scoville
96 10 7 Bravado Spice Company – Aka Miso Ghost Reaper 116000
11 2 2 Tapatío 3000
85 9 6 Hell Fire Detroit Habanero 66000
Code
season.sample(3)
season episodes note original_release last_release
6 7 12.0 NaN 2018-10-04 2018-12-20
2 3 24.0 NaN 2017-01-19 2017-06-29
14 15 12.0 NaN 2021-05-27 2021-08-12
Code
episodes.sample(3)
season episode_overall episode_season title original_release guest guest_appearance_number finished
250 17 251 9 Jacob Elordi Feels Euphoric While Eating Spicy... 2022-03-17 Jacob Elordi 1 True
181 11 182 8 Zac Efron Ups the Ante While Eating Spicy Wings 2020-04-02 Zac Efron 1 True
11 2 12 4 T.J. Miller Talks Deadpool, Hecklers, and Rela... 2016-02-12 T.J. Miller 1 True
Note

After taking a look at all three datasets, I think I will continue working with the sauces and episodes ones.

Questions

My main questions are:

  • What differences can be observed in the spiciness of sauces as we look across the various seasons?
  • Has every individual successfully completed all the episodes?
  • What is the average completion rate per season?
  • Is there a correlation between Scoville Units and completion rate?
  • Are there any returning guests?

Data Analysis

What differences can be observed in the spiciness of sauces as we look across the various seasons?

Code
(sauces
 .groupby('season')
 .agg(AVG_scoville=('scoville', 'mean'),
      median_scoville=('scoville', 'median'),
     )
 .reset_index()
 .melt(id_vars='season',
       var_name='statistics',
      )
 .pipe(lambda df: (ggplot(df, aes('season', 'value',fill='statistics'))
                   + geom_bar(stat='identity', show_legend= False)
                   + facet_wrap('statistics',nrow=1,scales='free_y')
                   + labs(x='Seasons',
                          y='Scoville Units'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                  + ggsize(1000,600)
                  )
      )
 
)
Figure 1: Average and median Scoville Units across the Hot Ones’ seasons

There seems to be a shift during the season 3-5 period as can be seen in Figure 1. Both indicators – mean and median Scoville Units – show a consistent upward trend over this time frame and later on they stabilize.

What about the overall spread of the data? 🤔

Code
(sauces
 .loc[:, ['season', 'scoville']]
 .pipe(lambda df: (ggplot(df, aes('season', 'scoville'))
                   + geom_boxplot()
                   + scale_y_log10()
                   + labs(x='Seasons',
                          y='log(Scoville Units)'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                   + ggsize(1000,600)
                  )
      )
)
Figure 2: Spread of Scoville Units across the Hot Ones’ seasons

Here are some observations: Season 5 exhibits the widest range, featuring sauces with Scoville Units spanning from 450 to 2,000,000. In addition, starting from season 6 onwards, the averages, medians, and ranges of Scoville Units appear to even out.

Has every individual successfully completed all the episodes?

To answer this question we will use the episodes dataframe. Keep in mind there are 300 episodes in this dataframe. The finished column can be useful here. Just by looking for entries where finished == False we will have our answer.

Code
(episodes
 .query("finished==False")
 [['season', 'guest', 'guest_appearance_number']]
)
season guest guest_appearance_number
0 1 Tony Yayo 1
7 1 DJ Khaled 1
19 2 Mike Epps 1
20 2 Jim Gaffigan 1
24 2 Rob Corddry 1
51 3 Ricky Gervais 1
90 4 Mario Batali 1
96 5 Taraji P. Henson 1
129 7 Lil Yachty 1
130 7 E-40 1
144 8 Shaq 1
171 10 Chance the Rapper 1
185 12 Eric André 2
218 15 Quavo 1
251 17 Pusha T 1

Taking a closer look, it seems that around 15 participants didn’t make it through the entire Hot Ones interview challenge.Not bad out of 300 shows. And guess what? Eric André popped up on the show not just once, but twice! Now, the big question: did he conquer the hot seat in at least one of those interviews? Let’s plot the data to make it more visual…

Code
(episodes
 .query("finished==False")
 .groupby('season')
 .finished
 .count()
 .to_frame()
 .reset_index()
 .pipe(lambda df: (ggplot(df, aes('season', 'finished'))
                   + geom_bar(stat='identity')
                   + labs(x='Seasons',
                          y='Number of incomplete interviews'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                   + ggsize(600,400)
                  )
      )

)
Figure 3: Number of incomplete interviews per season

Interestingly, a majority of these incomplete interviews belong to season 2 (Figure 3). This fact is quite surprising, especially when you consider that the maximum Scoville value for that season was only 550,000 – nearly a quarter of the following year’s value, where only one person faced difficulty finishing the challenge.

What is the completion rate per season?

To get to the bottom of this question, let’s start by figuring out how many episodes were in each season. We can grab this info from the season dataset. Just a heads-up, in season 9 they seem to threw in an extra episode. So, keep this in mind! Otherwise, you might end up with percentages that go beyond 100%.

Code
# First we need to find out the total number of episodes per season

episodes_per_season = (season
                       [['season', 'episodes', 'note']]
                       .set_index('season')
                       # we need to extract the one extra episode in season 9
                       .assign(note=lambda df: df.note
                               .str.extract(r'([0-9.]+)')
                               .astype(float),
                        # add the two column together
                               episodes=lambda df: df.episodes
                               .add(df.note, fill_value=0)
                              )
                       .drop(columns='note')
                       .squeeze()
                      )
Code
completion_rate = (episodes
                   .query("finished==True")
                   .groupby('season')
                   .finished
                   .sum()
                   .div(episodes_per_season)
                   .mul(100)
                   .to_frame().reset_index()
                   .rename(columns={0:'completion_rate'})
                  )
                   
(completion_rate                  
 .pipe(lambda df: (ggplot(df, aes('season', 'completion_rate'))
                   + geom_line(stat='identity')
                   + labs(x='Seasons',
                          y='% successful participants'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                   + ggsize(600,400)
                  )
      )
)
Figure 4: Completion rate per season

Taking a peek at Figure 4, it seems like the normalized completion rate hits its lowest point in season 1, closely followed by season 7. However, even in those seasons, the rate remains surprisingly high.

Is there a correlation between Scoville Units and completion rate?

Here’s a curious thought: could there be a link between Scoville Units and the completion rate? I’m just wondering if the spiciness level affects how well participants handle the challenge. Exploring this connection might add a spicy twist to the Hot Ones experience – let’s see where the data takes us!

Code
# AVG_scoville code comes from a code snippet 
# 'What differences can be observed in the spiciness 
# of sauces as we look across the various seasons?'

AVG_scoville = (sauces
                .groupby('season')
                .agg(AVG_scoville=('scoville', 'mean'))
                .squeeze()
               )

# Let's calculate the Pearson correlation coefficient. We have to discard the last value
# as the completion rate is not defined for that

stats.pearsonr(AVG_scoville.values[:-1],completion_rate.completion_rate.values[:-1])
PearsonRResult(statistic=0.5039021286250108, pvalue=0.02349225734224539)

The Pearson correlation coefficient has its say: there’s actually a moderate positive correlation(0.5, p<0.05) between Scoville Units and completion rate. Quite intriguing, isn’t it? Honestly, I was expecting the opposite outcome myself! It seems like the higher the average spiciness, the more determined the guests become. Take a look at Figure 5.

Code
(pd.concat([AVG_scoville,completion_rate],axis=1)
 .rename(columns={0:'success_rate'})
 .pipe(lambda df: (ggplot(df, aes('AVG_scoville', 'completion_rate'))
                   + geom_point(size=5, alpha=0.5)
                   + geom_smooth()
                   + labs(x='Average Scoville units',
                          y='% successful participants'
                         )
                   + theme(plot_title=element_text(size=20,face='bold'))
                   + ggsize(600,400)
                  )
      )
)
Figure 5: Correlation between Average Scoville units and Completion rates

Are there any returning guests?

Has there been a brave soul who dared to make a return to the show for a second time? The column guest_appearance_number holds the answers you’re looking for.

Code
(episodes
 .query("guest_appearance_number > 1")
 [['guest','season', 'episode_season', 'finished']]
)
guest season episode_season finished
124 Eddie Huang 6 13 True
161 Jay Pharoah 9 999 True
183 Tom Segura 12 1 True
185 Eric André 12 3 False
188 T-Pain 12 6 True
189 Adam Richman 12 7 True
190 Action Bronson 12 8 True
203 NaN 13 11 True
214 Russell Brand 14 11 True
215 Steve-O 14 12 True
241 Gordon Ramsay 16 14 True
254 Post Malone 18 1 True

It looks like a total of 12 individuals have taken on the challenge not once, but twice. Hats off to their courage! 🎩

Final words

And there you have it, folks – that’s a wrap! I hope you enjoyed exploring the Hot Ones dataset with me. Rest assured, more of these analyses are in the pipeline for the future. Stay tuned for what’s to come!