Code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from lets_plot import *
from lets_plot.mapping import as_discrete
LetsPlot.setup_html()
Adam Cseresznye
August 9, 2023
I’ve decided to get more involved in the Tidy Tuesday movement. I think it’s a really enjoyable way to improve my skills in working with data, like analyzing, organizing, and visualizing it. The cool datasets they provide make it even more interesting. More information on Tidy Tuesday and their past datasets can be found here.
This week we have a dataset related to the show Hot Ones. Hot Ones is a unique web series that combines spicy food challenges with celebrity interviews. Hosted by Sean Evans, guests tackle increasingly hot chicken wings while answering questions, leading to candid and entertaining conversations. The show’s blend of heat and honesty has turned it into a global sensation, offering a fresh take on interviews and captivating audiences worldwide.
Let’s see what we can learn from the data 🔬🕵️♂️.
We have three dataframes: sauces
, season
and episodes
. For more information about the data dictionary see the GitHub repo.
sauces=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/sauces.csv')
season=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/seasons.csv')
episodes=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/episodes.csv')
sauces.sample(3)
season | sauce_number | sauce_name | scoville | |
---|---|---|---|---|
96 | 10 | 7 | Bravado Spice Company – Aka Miso Ghost Reaper | 116000 |
11 | 2 | 2 | Tapatío | 3000 |
85 | 9 | 6 | Hell Fire Detroit Habanero | 66000 |
season | episodes | note | original_release | last_release | |
---|---|---|---|---|---|
6 | 7 | 12.0 | NaN | 2018-10-04 | 2018-12-20 |
2 | 3 | 24.0 | NaN | 2017-01-19 | 2017-06-29 |
14 | 15 | 12.0 | NaN | 2021-05-27 | 2021-08-12 |
season | episode_overall | episode_season | title | original_release | guest | guest_appearance_number | finished | |
---|---|---|---|---|---|---|---|---|
250 | 17 | 251 | 9 | Jacob Elordi Feels Euphoric While Eating Spicy... | 2022-03-17 | Jacob Elordi | 1 | True |
181 | 11 | 182 | 8 | Zac Efron Ups the Ante While Eating Spicy Wings | 2020-04-02 | Zac Efron | 1 | True |
11 | 2 | 12 | 4 | T.J. Miller Talks Deadpool, Hecklers, and Rela... | 2016-02-12 | T.J. Miller | 1 | True |
After taking a look at all three datasets, I think I will continue working with the sauces
and episodes
ones.
My main questions are:
(sauces
.groupby('season')
.agg(AVG_scoville=('scoville', 'mean'),
median_scoville=('scoville', 'median'),
)
.reset_index()
.melt(id_vars='season',
var_name='statistics',
)
.pipe(lambda df: (ggplot(df, aes('season', 'value',fill='statistics'))
+ geom_bar(stat='identity', show_legend= False)
+ facet_wrap('statistics',nrow=1,scales='free_y')
+ labs(x='Seasons',
y='Scoville Units'
)
+ theme(plot_title=element_text(size=20,face='bold'))
+ ggsize(1000,600)
)
)
)
There seems to be a shift during the season 3-5 period as can be seen in Figure 1. Both indicators – mean and median Scoville Units – show a consistent upward trend over this time frame and later on they stabilize.
What about the overall spread of the data? 🤔
Here are some observations: Season 5 exhibits the widest range, featuring sauces with Scoville Units spanning from 450 to 2,000,000. In addition, starting from season 6 onwards, the averages, medians, and ranges of Scoville Units appear to even out.
To answer this question we will use the episodes
dataframe. Keep in mind there are 300 episodes in this dataframe. The finished
column can be useful here. Just by looking for entries where finished == False
we will have our answer.
season | guest | guest_appearance_number | |
---|---|---|---|
0 | 1 | Tony Yayo | 1 |
7 | 1 | DJ Khaled | 1 |
19 | 2 | Mike Epps | 1 |
20 | 2 | Jim Gaffigan | 1 |
24 | 2 | Rob Corddry | 1 |
51 | 3 | Ricky Gervais | 1 |
90 | 4 | Mario Batali | 1 |
96 | 5 | Taraji P. Henson | 1 |
129 | 7 | Lil Yachty | 1 |
130 | 7 | E-40 | 1 |
144 | 8 | Shaq | 1 |
171 | 10 | Chance the Rapper | 1 |
185 | 12 | Eric André | 2 |
218 | 15 | Quavo | 1 |
251 | 17 | Pusha T | 1 |
Taking a closer look, it seems that around 15 participants didn’t make it through the entire Hot Ones interview challenge.Not bad out of 300 shows. And guess what? Eric André popped up on the show not just once, but twice! Now, the big question: did he conquer the hot seat in at least one of those interviews? Let’s plot the data to make it more visual…
(episodes
.query("finished==False")
.groupby('season')
.finished
.count()
.to_frame()
.reset_index()
.pipe(lambda df: (ggplot(df, aes('season', 'finished'))
+ geom_bar(stat='identity')
+ labs(x='Seasons',
y='Number of incomplete interviews'
)
+ theme(plot_title=element_text(size=20,face='bold'))
+ ggsize(600,400)
)
)
)
Interestingly, a majority of these incomplete interviews belong to season 2 (Figure 3). This fact is quite surprising, especially when you consider that the maximum Scoville value for that season was only 550,000 – nearly a quarter of the following year’s value, where only one person faced difficulty finishing the challenge.
To get to the bottom of this question, let’s start by figuring out how many episodes were in each season. We can grab this info from the season dataset. Just a heads-up, in season 9 they seem to threw in an extra episode. So, keep this in mind! Otherwise, you might end up with percentages that go beyond 100%.
# First we need to find out the total number of episodes per season
episodes_per_season = (season
[['season', 'episodes', 'note']]
.set_index('season')
# we need to extract the one extra episode in season 9
.assign(note=lambda df: df.note
.str.extract(r'([0-9.]+)')
.astype(float),
# add the two column together
episodes=lambda df: df.episodes
.add(df.note, fill_value=0)
)
.drop(columns='note')
.squeeze()
)
completion_rate = (episodes
.query("finished==True")
.groupby('season')
.finished
.sum()
.div(episodes_per_season)
.mul(100)
.to_frame().reset_index()
.rename(columns={0:'completion_rate'})
)
(completion_rate
.pipe(lambda df: (ggplot(df, aes('season', 'completion_rate'))
+ geom_line(stat='identity')
+ labs(x='Seasons',
y='% successful participants'
)
+ theme(plot_title=element_text(size=20,face='bold'))
+ ggsize(600,400)
)
)
)
Taking a peek at Figure 4, it seems like the normalized completion rate hits its lowest point in season 1, closely followed by season 7. However, even in those seasons, the rate remains surprisingly high.
Here’s a curious thought: could there be a link between Scoville Units and the completion rate? I’m just wondering if the spiciness level affects how well participants handle the challenge. Exploring this connection might add a spicy twist to the Hot Ones experience – let’s see where the data takes us!
# AVG_scoville code comes from a code snippet
# 'What differences can be observed in the spiciness
# of sauces as we look across the various seasons?'
AVG_scoville = (sauces
.groupby('season')
.agg(AVG_scoville=('scoville', 'mean'))
.squeeze()
)
# Let's calculate the Pearson correlation coefficient. We have to discard the last value
# as the completion rate is not defined for that
stats.pearsonr(AVG_scoville.values[:-1],completion_rate.completion_rate.values[:-1])
PearsonRResult(statistic=0.5039021286250108, pvalue=0.02349225734224539)
The Pearson correlation coefficient has its say: there’s actually a moderate positive correlation(0.5, p<0.05) between Scoville Units and completion rate. Quite intriguing, isn’t it? Honestly, I was expecting the opposite outcome myself! It seems like the higher the average spiciness, the more determined the guests become. Take a look at Figure 5.
(pd.concat([AVG_scoville,completion_rate],axis=1)
.rename(columns={0:'success_rate'})
.pipe(lambda df: (ggplot(df, aes('AVG_scoville', 'completion_rate'))
+ geom_point(size=5, alpha=0.5)
+ geom_smooth()
+ labs(x='Average Scoville units',
y='% successful participants'
)
+ theme(plot_title=element_text(size=20,face='bold'))
+ ggsize(600,400)
)
)
)
Has there been a brave soul who dared to make a return to the show for a second time? The column guest_appearance_number
holds the answers you’re looking for.
guest | season | episode_season | finished | |
---|---|---|---|---|
124 | Eddie Huang | 6 | 13 | True |
161 | Jay Pharoah | 9 | 999 | True |
183 | Tom Segura | 12 | 1 | True |
185 | Eric André | 12 | 3 | False |
188 | T-Pain | 12 | 6 | True |
189 | Adam Richman | 12 | 7 | True |
190 | Action Bronson | 12 | 8 | True |
203 | NaN | 13 | 11 | True |
214 | Russell Brand | 14 | 11 | True |
215 | Steve-O | 14 | 12 | True |
241 | Gordon Ramsay | 16 | 14 | True |
254 | Post Malone | 18 | 1 | True |
It looks like a total of 12 individuals have taken on the challenge not once, but twice. Hats off to their courage! 🎩
And there you have it, folks – that’s a wrap! I hope you enjoyed exploring the Hot Ones dataset with me. Rest assured, more of these analyses are in the pipeline for the future. Stay tuned for what’s to come!
---
title: 'Tidy Tuesday: Hot Ones Episodes'
author: Adam Cseresznye
date: '2023-08-09'
categories:
- Tidy Tuesday
jupyter: python3
toc: true
format:
html:
code-fold: true
code-tools: true
---
#| label: Hot ones logo
#| fig-cap: "fig1"
![](Hot_Ones_by_First_We_Feast_logo.svg.png){fig-align="center" width=50%}
I've decided to get more involved in the Tidy Tuesday movement. I think it's a really enjoyable way to improve my skills in working with data, like analyzing, organizing, and visualizing it. The cool datasets they provide make it even more interesting. More information on Tidy Tuesday and their past datasets can be found [here](https://github.com/rfordatascience/tidytuesday).
This week we have a dataset related to the show Hot Ones. [Hot Ones](https://www.youtube.com/playlist?list=PLAzrgbu8gEMIIK3r4Se1dOZWSZzUSadfZ) is a unique web series that combines spicy food challenges with celebrity interviews. Hosted by Sean Evans, guests tackle increasingly hot chicken wings while answering questions, leading to candid and entertaining conversations. The show's blend of heat and honesty has turned it into a global sensation, offering a fresh take on interviews and captivating audiences worldwide.
Let's see what we can learn from the data 🔬🕵️♂️.
# Import data
```{python}
#| tags: []
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from lets_plot import *
from lets_plot.mapping import as_discrete
LetsPlot.setup_html()
```
We have three dataframes: `sauces`, `season` and `episodes`. For more information about the data dictionary see the [GitHub repo](https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-08/readme.md).
```{python}
#| tags: []
sauces=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/sauces.csv')
season=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/seasons.csv')
episodes=pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/episodes.csv')
sauces.sample(3)
```
```{python}
#| tags: []
season.sample(3)
```
```{python}
#| tags: []
episodes.sample(3)
```
::: {.callout-note}
After taking a look at all three datasets, I think I will continue working with the `sauces` and `episodes` ones.
:::
# Questions
**My main questions are:**
* What differences can be observed in the spiciness of sauces as we look across the various seasons?
* Has every individual successfully completed all the episodes?
* What is the average completion rate per season?
* Is there a correlation between Scoville Units and completion rate?
* Are there any returning guests?
# Data Analysis
## What differences can be observed in the spiciness of sauces as we look across the various seasons?
```{python}
#| fig-cap: Average and median Scoville Units across the Hot Ones' seasons
#| label: fig-fig1
#| tags: []
(sauces
.groupby('season')
.agg(AVG_scoville=('scoville', 'mean'),
median_scoville=('scoville', 'median'),
)
.reset_index()
.melt(id_vars='season',
var_name='statistics',
)
.pipe(lambda df: (ggplot(df, aes('season', 'value',fill='statistics'))
+ geom_bar(stat='identity', show_legend= False)
+ facet_wrap('statistics',nrow=1,scales='free_y')
+ labs(x='Seasons',
y='Scoville Units'
)
+ theme(plot_title=element_text(size=20,face='bold'))
+ ggsize(1000,600)
)
)
)
```
There seems to be a shift during the season 3-5 period as can be seen in @fig-fig1. Both indicators – *mean* and *median* Scoville Units – show a consistent upward trend over this time frame and later on they stabilize.
What about the overall spread of the data? 🤔
```{python}
#| fig-cap: Spread of Scoville Units across the Hot Ones' seasons
#| label: fig-fig2
#| tags: []
(sauces
.loc[:, ['season', 'scoville']]
.pipe(lambda df: (ggplot(df, aes('season', 'scoville'))
+ geom_boxplot()
+ scale_y_log10()
+ labs(x='Seasons',
y='log(Scoville Units)'
)
+ theme(plot_title=element_text(size=20,face='bold'))
+ ggsize(1000,600)
)
)
)
```
Here are some observations: Season 5 exhibits the widest range, featuring sauces with Scoville Units spanning from 450 to 2,000,000. In addition, starting from season 6 onwards, the *averages*, *medians*, and *ranges* of Scoville Units appear to even out.
## Has every individual successfully completed all the episodes?
To answer this question we will use the `episodes` dataframe. Keep in mind there are 300 episodes in this dataframe. The `finished` column can be useful here. Just by looking for entries where `finished == False` we will have our answer.
```{python}
#| tags: []
(episodes
.query("finished==False")
[['season', 'guest', 'guest_appearance_number']]
)
```
Taking a closer look, it seems that around 15 participants didn't make it through the entire Hot Ones interview challenge.Not bad out of 300 shows. And guess what? Eric André popped up on the show not just once, but twice! Now, the big question: did he conquer the hot seat in at least one of those interviews? Let's plot the data to make it more visual...
```{python}
#| fig-cap: Number of incomplete interviews per season
#| label: fig-fig3
#| tags: []
(episodes
.query("finished==False")
.groupby('season')
.finished
.count()
.to_frame()
.reset_index()
.pipe(lambda df: (ggplot(df, aes('season', 'finished'))
+ geom_bar(stat='identity')
+ labs(x='Seasons',
y='Number of incomplete interviews'
)
+ theme(plot_title=element_text(size=20,face='bold'))
+ ggsize(600,400)
)
)
)
```
Interestingly, a majority of these incomplete interviews belong to season 2 (@fig-fig3). This fact is quite surprising, especially when you consider that the maximum Scoville value for that season was only 550,000 – nearly a quarter of the following year's value, where only one person faced difficulty finishing the challenge.
![](https://thumbs.gfycat.com/DishonestPoorBluebreastedkookaburra-max-1mb.gif){fig-align="center"}
## What is the completion rate per season?
To get to the bottom of this question, let's start by figuring out how many episodes were in each season. We can grab this info from the season dataset. Just a heads-up, in season 9 they seem to threw in an extra episode. So, keep this in mind! Otherwise, you might end up with percentages that go beyond 100%.
```{python}
#| tags: []
# First we need to find out the total number of episodes per season
episodes_per_season = (season
[['season', 'episodes', 'note']]
.set_index('season')
# we need to extract the one extra episode in season 9
.assign(note=lambda df: df.note
.str.extract(r'([0-9.]+)')
.astype(float),
# add the two column together
episodes=lambda df: df.episodes
.add(df.note, fill_value=0)
)
.drop(columns='note')
.squeeze()
)
```
```{python}
#| fig-cap: Completion rate per season
#| label: fig-fig4
#| tags: []
completion_rate = (episodes
.query("finished==True")
.groupby('season')
.finished
.sum()
.div(episodes_per_season)
.mul(100)
.to_frame().reset_index()
.rename(columns={0:'completion_rate'})
)
(completion_rate
.pipe(lambda df: (ggplot(df, aes('season', 'completion_rate'))
+ geom_line(stat='identity')
+ labs(x='Seasons',
y='% successful participants'
)
+ theme(plot_title=element_text(size=20,face='bold'))
+ ggsize(600,400)
)
)
)
```
Taking a peek at @fig-fig4, it seems like the normalized completion rate hits its lowest point in season 1, closely followed by season 7. However, even in those seasons, the rate remains surprisingly high.
## Is there a correlation between Scoville Units and completion rate?
Here's a curious thought: could there be a link between Scoville Units and the completion rate? I'm just wondering if the spiciness level affects how well participants handle the challenge. Exploring this connection might add a spicy twist to the Hot Ones experience – let's see where the data takes us!
```{python}
#| tags: []
# AVG_scoville code comes from a code snippet
# 'What differences can be observed in the spiciness
# of sauces as we look across the various seasons?'
AVG_scoville = (sauces
.groupby('season')
.agg(AVG_scoville=('scoville', 'mean'))
.squeeze()
)
# Let's calculate the Pearson correlation coefficient. We have to discard the last value
# as the completion rate is not defined for that
stats.pearsonr(AVG_scoville.values[:-1],completion_rate.completion_rate.values[:-1])
```
The Pearson correlation coefficient has its say: there's actually a moderate positive correlation(0.5, *p*<0.05) between Scoville Units and completion rate. Quite intriguing, isn't it? Honestly, I was expecting the opposite outcome myself! It seems like the higher the average spiciness, the more determined the guests become. Take a look at @fig-fig5.
```{python}
#| fig-cap: Correlation between Average Scoville units and Completion rates
#| label: fig-fig5
#| tags: []
(pd.concat([AVG_scoville,completion_rate],axis=1)
.rename(columns={0:'success_rate'})
.pipe(lambda df: (ggplot(df, aes('AVG_scoville', 'completion_rate'))
+ geom_point(size=5, alpha=0.5)
+ geom_smooth()
+ labs(x='Average Scoville units',
y='% successful participants'
)
+ theme(plot_title=element_text(size=20,face='bold'))
+ ggsize(600,400)
)
)
)
```
## Are there any returning guests?
Has there been a brave soul who dared to make a return to the show for a second time? The column `guest_appearance_number` holds the answers you're looking for.
```{python}
#| tags: []
(episodes
.query("guest_appearance_number > 1")
[['guest','season', 'episode_season', 'finished']]
)
```
It looks like a total of 12 individuals have taken on the challenge not once, but twice. Hats off to their courage! 🎩
# Final words
And there you have it, folks – that's a wrap! I hope you enjoyed exploring the Hot Ones dataset with me. Rest assured, more of these analyses are in the pipeline for the future. Stay tuned for what's to come!