Tidy Tuesday: The Global Human Day

Tidy Tuesday
Author

Adam Cseresznye

Published

September 11, 2023

Photo by Filip Mroz

This week’s Tidy Tuesday takes us to The Human Chronome Project, an initiative based at McGill University in Montreal. We’re exploring insights from their paper titled “The global human day in PNAS” and the accompanying dataset on Zenodo.

The publication offers a broad view of our species’ activities, shedding light on how economic activities fit into the grand scheme of life. It also highlights activities with significant potential for change. For a deeper dive into our research methodology and to explore supplementary visualizations, I encourage you to peruse the supplementary materials provided with the publication.

Import data

Code
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm
from lets_plot.bistro.qq import qq_plot

import flag
from PIL import Image
from pathlib import Path

from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

While we have four distinct datasets at our disposal, our primary focus will be on two of them. The ‘all_countries’ dataset holds comprehensive information about the time allocated to each activity per region and country, while the ‘country_regions’ dataset provides the essential correspondence between country names and their ISO-2 and ISO-3 abbreviations. Let’s proceed by reading them all in.

Code
all_countries = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-12/all_countries.csv"
)
country_regions = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-12/country_regions.csv"
)
global_human_day = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-12/global_human_day.csv"
)
global_economic_activity = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-12/global_economic_activity.csv"
)

Data dictionaries

Here are the corresponding data dictionaries:

Data Dictionary

all_countries.csv

variable class description
Category character M24 categories
Subcategory character M24 subcategories
country_iso3 character Country code in iso3
region_code character Region code
population double Population
hoursPerDayCombined double Hours per day combined for the country
uncertaintyCombined double Uncertainty combined. Uncertainty is in units variance.

country_regions.csv

variable class description
region_code character Region code
region_name character Region name
country_name character Country name
M49_code double M49 code
country_iso2 character Country code in iso2
country_iso3 character Country code in iso3
alt_country_name character Alternative country name
alt_country_name1 character Alternative country name 1
alt_country_name2 character Alternative country name 2
alt_country_name3 character Alternative country name 3
alt_country_name4 character Alternative country name 4
alt_country_name5 character Alternative country name 5
alt_country_name6 character Alternative country name 6
other_code1 character Other country code 1
other_code2 character Other country code 2

global_human_day.csv

variable class description
Subcategory character M24 subcategory
hoursPerDay double Hours per day for all countries
uncertainty double Uncertainty in units variance.

global_economic_activity.csv

variable class description
Subcategory character M24 subcategory
hoursPerDay double Hours per day for all countries.
uncertainty double Uncertainty in units variance.

As in our previous Tidy Tuesday blog, I think we can address the following questions using this dataset: 1. Which regions allocate the most and least time to each subcategory? 2. Which countries exhibit the highest and lowest time allocation for each subcategory? 3. Is there a discernible relationship between time spent on each category and GDP per capita?

Get GDP per capita data from Wikipedia

To prepare for our third question, let’s grab GDP per capita data from Wikipedia. It will come in handy as we dive deeper into our analysis!

Code
# Define a dictionary for country name replacements
names_to_replace = {
    "Russia": "Russian Federation",
    "South Korea": "Korea, Republic of",
    "Vietnam": "Viet Nam",
    "Hong Kong": "Hong Kong, China",
    "Iran": "Iran, Islamic Republic of",
    "Venezuela": "Venezuela, Bolivarian Republic of",
    "Tanzania": "Tanzania, United Republic of",
    "DR Congo": "Congo",
    "Palestine": "Occupied Palestinian Territory",
    "Brunei": "Brunei Darussalam",
    "North Korea": "Korea Democratic People's Republic of",
    "Moldova": "Moldova, Republic of",
    "Syria": "Syrian Arab Republic",
    "North Macedonia": "Macedonia the former Yugoslav Republic of",
    "Laos": "Lao People's Democratic Republic",
    "Eswatini": "Swaziland",
    "Curaçao": "Curacao",
    "East Timor": "Timor-Leste",
    "Saint Vincent and the Grenadines": "Saint Vincent and Grenadines",
    "Sint Maarten": "Sint Maarten (Kingdom of the Netherlands)",
    "São Tomé and Príncipe": "Sao Tome and Principe",
    "Micronesia": "Micronesia Federated States of",
    "Macau": "Macau, China",
}

# Read GDP data from Wikipedia
gdp = (
    pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")[2]
    .pipe(lambda s: s.set_axis(s.columns.map("_".join), axis=1))
    .loc[
        :,
        [
            "Country/Territory_Country/Territory",
            "UN region_UN region",
            "United Nations[15]_Estimate",
        ],
    ]
    .rename(
        columns={
            "Country/Territory_Country/Territory": "country",
            "UN region_UN region": "UN_region",
            "United Nations[15]_Estimate": "gdp",
        }
    )
    .assign(country=lambda df: df.country.replace(names_to_replace))
    .assign(
        gdp=lambda df: df.gdp.replace(r"\D+", "", regex=True),
        iso3=lambda df: df.country.map(
            country_regions.loc[:, ["alt_country_name", "country_iso3"]]
            .set_index("alt_country_name")
            .squeeze()
            .to_dict()
        ),
    )
    .query("~iso3.isna()")
)
gdp
country UN_region gdp iso3
1 United States Americas 23315081 USA
2 China Asia 17734131 CHN
3 Japan Asia 4940878 JPN
4 Germany Europe 4259935 DEU
5 India Asia 3201471 IND
... ... ... ... ...
209 Anguilla Americas 303 AIA
210 Kiribati Oceania 227 KIR
211 Nauru Oceania 155 NRU
212 Montserrat Americas 72 MSR
213 Tuvalu Oceania 60 TUV

209 rows × 4 columns

After obtaining the GDP per capita data, we’ll integrate it into our main dataset.

Code
df = (
    all_countries.assign(
        HPD_with_u=lambda df: df.hoursPerDayCombined + df.uncertaintyCombined
    )
    .assign(
        gdp=lambda df: df.country_iso3.map(gdp.set_index("iso3").gdp.to_dict()).astype(
            float
        )
    )
    .query("~gdp.isna()")
    .assign(
        gdp_per_population=lambda df: df.gdp / df.population.div(1_000_000),
        gdp_per_population_log=lambda df: np.log10(df.gdp_per_population),
        Subcategory=lambda df: df.Subcategory.str.replace(" ", "_"),
    )
)

df
Category Subcategory country_iso3 region_code population hoursPerDayCombined uncertaintyCombined HPD_with_u gdp gdp_per_population gdp_per_population_log
0 Food provision Food_preparation ABW AM_C 101665 1.47 0.237630 1.707630 3126.0 30748.045050 4.487818
1 Food provision Food_growth_&_collection ABW AM_C 101665 0.17 0.026880 0.196880 3126.0 30748.045050 4.487818
2 Food provision Food_processing ABW AM_C 101665 0.00 0.007526 0.007526 3126.0 30748.045050 4.487818
3 Nonfood provision Materials ABW AM_C 101665 0.03 0.002072 0.032072 3126.0 30748.045050 4.487818
4 Nonfood provision Energy ABW AM_C 101665 0.05 0.003597 0.053597 3126.0 30748.045050 4.487818
... ... ... ... ... ... ... ... ... ... ... ...
4771 Experience oriented Social ZWE AF_E 14645473 0.93 0.594238 1.524238 24118.0 1646.788738 3.216638
4772 Experience oriented Active_recreation ZWE AF_E 14645473 0.39 0.100182 0.490182 24118.0 1646.788738 3.216638
4773 Experience oriented Meals ZWE AF_E 14645473 1.88 1.262100 3.142100 24118.0 1646.788738 3.216638
4774 Experience oriented Passive ZWE AF_E 14645473 2.25 1.815564 4.065564 24118.0 1646.788738 3.216638
4775 Experience oriented Interactive ZWE AF_E 14645473 1.10 0.350849 1.450849 24118.0 1646.788738 3.216638

4512 rows × 11 columns

Which regions allocate the most and least time to each subcategory?

First, to provide an overview of the prevailing trends, we will begin by examining disparities in time allocation across various subcategories within different regions. Following this, we will delve into a more granular analysis (as part of our second question), considering the data at the country level. As depicted in Figure 1, a noticeable pattern emerges. Typically, the minimum and maximum values for each activity type appear quite similar. However, it gets intriguing when we focus on certain categories like ‘Social,’ ‘Food Growth & Collection,’ ‘Schooling & Research,’ ‘Sleep & Bedrest,’ and ‘Passive.’ In these categories, we observe more significant differences, with gaps ranging from 1.05 to 1.59 hours between the maximum and minimum hours spent.

Code
maximum_values = (
    df.groupby(["region_code", "Subcategory"])
    .mean(numeric_only=True)
    .reset_index()
    .groupby(["Subcategory"])
    .apply(lambda x: x.sort_values(ascending=False, by="hoursPerDayCombined").head(1))
    .reset_index(drop=True)[["region_code", "Subcategory", "hoursPerDayCombined"]]
)

minimum_values = (
    df.groupby(["region_code", "Subcategory"])
    .mean(numeric_only=True)
    .reset_index()
    .groupby(["Subcategory"])
    .apply(lambda x: x.sort_values(ascending=True, by="hoursPerDayCombined").head(1))
    .reset_index(drop=True)[["region_code", "Subcategory", "hoursPerDayCombined"]]
)

# Concatenate the maximum and minimum records
concatenated_records = pd.concat([maximum_values, minimum_values], axis=0).reset_index(
    drop=True
)

# Assign flags, country names, and format subcategory names
(
    concatenated_records.assign(
        Subcategory=lambda df: df.Subcategory.str.replace("_", " "),
        group=lambda df: df.region_code.mask(
            df.region_code.str.contains("EU"), True
        ).mask(~df.region_code.str.contains("EU"), False),
    ).pipe(
        lambda df: ggplot(
            df,
            aes(
                "hoursPerDayCombined",
                as_discrete("Subcategory", order=1, order_by="hoursPerDayCombined"),
            ),
        )
        + geom_point(aes(color="group"), size=5, alpha=0.5, show_legend=False)
        + geom_text(
            aes(label="region_code"),
            position=position_nudge(y=0.25, x=-0),
            size=5,
            na_text="",
            angle=15,
            color="#2b8cbe",
            tooltips=layer_tooltips(
                ["Subcategory", "region_code", "hoursPerDayCombined"]
            ),
        )
        + labs(
            title="Regions with the Most and Least Time Spent on Each Subcategory",  # Title
            subtitle="""Abbreviations: EU = Europe, AF = Africa, AS = Asia, AM = Americas, ANZ = Australia and New Zealand
            E = East, W = West, N = North, S = South, C = Central, M = Middle
            Values from Europe are highlighted in green.
            """,  # Subtitle
            x="Mean Hours per day combined per Region",
            y="",  # Y-axis label
            caption="doi.org/10.1073/pnas.2219564120",  # Caption with the source DOI
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + scale_x_continuous(
            trans="log10", limits=[0.001, 24]
        )  # Set limits for the x-axis
        + ggsize(800, 600)  # Set the plot size
    )
)
Figure 1: Regions with the Most and Least Time Spent on Each Subcategory

Which countries allocate the most and least time to each subcategory?

As depicted in Figure 2 below, on average, the most significant time allocation is dedicated to Sleep and Bedrest, followed by Passive, Interactive, and Social Activities, as well as Activities Centered on Eating and Drinking, including Associated Socializing. While drawing definitive conclusions from this figure may be challenging, Japan stands out as a unique case. They allocate the least amount of time to Sleep and Bedrest while devoting the most time to Passive Activities and Hygiene and Grooming—an interesting observation.

Code
# Calculate the maximum values for each subcategory
maximum_values = (
    df.groupby(["Subcategory"])
    .apply(lambda x: x.sort_values(ascending=False, by="hoursPerDayCombined").head(1))[
        ["country_iso3", "Subcategory", "hoursPerDayCombined"]
    ]
    .reset_index(drop=True)
    .assign(
        iso2=lambda df: df.country_iso3.map(
            country_regions.set_index("country_iso3").country_iso2.squeeze().to_dict()
        )
    )
)

# Calculate the minimum values for each subcategory
minimum_values = (
    df.groupby(["Subcategory"])
    .apply(lambda x: x.sort_values(ascending=True, by="hoursPerDayCombined").head(1))[
        ["country_iso3", "Subcategory", "hoursPerDayCombined"]
    ]
    .reset_index(drop=True)
    .assign(
        iso2=lambda df: df.country_iso3.map(
            country_regions.set_index("country_iso3").country_iso2.squeeze().to_dict()
        )
    )
)

# Concatenate the maximum and minimum records
concatenated_records = pd.concat([maximum_values, minimum_values], axis=0).reset_index(
    drop=True
)

# Assign flags, country names, and format subcategory names
(
    concatenated_records.assign(
        flag=lambda df: df.iso2.apply(flag.flag),
        name=lambda df: df.country_iso3.map(
            country_regions.set_index("country_iso3").country_name.squeeze().to_dict()
        ),
        Subcategory=lambda df: df.Subcategory.str.replace("_", " "),
    ).pipe(
        lambda df: ggplot(
            df,
            aes(
                "hoursPerDayCombined",
                as_discrete("Subcategory", order=1, order_by="hoursPerDayCombined"),
            ),
        )
        + geom_point(size=5, alpha=0.25)
        + geom_text(
            aes(label="flag"),
            position=position_nudge(y=0.5, x=-0),
            size=8,
            na_text="",
            angle=15,
            color="#2b8cbe",
            tooltips=layer_tooltips(["Subcategory", "name", "hoursPerDayCombined"]),
        )
        + labs(
            title="Countries with the Most and Least Time Spent on Each Subcategory",  # Title
            subtitle="""On average, the highest time allocation is for Sleep and Bedrest, followed by Passive, Interactive, and 
            Social Activities, and Activities Centered on Eating and Drinking, including Associated Socializing""",  # Subtitle
            x="Hours per day combined for the country",
            y="",  # Y-axis label
            caption="doi.org/10.1073/pnas.2219564120",  # Caption with the source DOI
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + scale_x_continuous(limits=[0, 11])  # Set limits for the x-axis
        + ggsize(800, 600)  # Set the plot size
    )
)
Figure 2: Countries with the Most and Least Time Spent on Each Subcategory

Is there a discernible relationship between time spent on each category and GDP per capita?

Now, let’s make the most of our ‘gdp_per_population_log’ column. Please note that I’m not an economics expert, and our goal here is to identify intriguing trends without delving too deeply into interpretation. For a comprehensive understanding of the dataset, I recommend referring to the original publication.

Let’s begin by examining the broader trends. We aim to identify activities that appear to exhibit a noteworthy correlation with GDP per population, bearing in mind that GDP per population is log-transformed for our analysis. As evident from Figure 3, we’ve come across some intriguing, and perhaps not entirely surprising, observations. For instance, the data hints at a negative relationship between the hours per day spent on schooling and research and GDP per capita. This finding might raise eyebrows, as conventional wisdom often suggests that higher education leads to increased economic output. However, it’s worth considering that spending too much time in the educational system could potentially limit one’s participation in economic production compared to individuals with lower levels of education. It’s a thought-provoking discovery, and while it may challenge some assumptions, it certainly adds depth to our understanding of the data.

Code
(
    df.assign(Subcategory=lambda df: df.Subcategory.str.replace("_", " ")).pipe(
        lambda df: ggplot(df, aes("hoursPerDayCombined", "gdp_per_population_log"))
        + geom_point(alpha=0.5)
        + geom_smooth()
        + facet_wrap("Subcategory", scales="free")
        + labs(
            title="Relationship between Hours per Day Combined and Log(GDP per Population)",  # Title
            subtitle="""Activities vs. GDP per Capita at the country scale. Each circle on the chart 
            represents the population-average time spent per day for an individual country.""",
            x="Hours per day combined",
            y="log(GDP per population)",  # Y-axis label
            caption="doi.org/10.1073/pnas.2219564120",  # Caption with the source DOI
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 800)  # Set the plot size
    )
)
Figure 3: Relationship between Hours per Day Combined and Log(GDP per Population)

To further explore the connection between ‘gdp_per_population_log’ and ‘hoursPerDayCombined,’ we can utilize the statsmodels library to execute an Ordinary Least Squares (OLS) regression analysis. This method enables us to compute the β coefficients, in addition to establishing upper and lower confidence intervals, as well as deriving the associated p-values. To streamline this process, we’ll simplify our workflow by creating a function called get_linear_regression_table. This function will initially filter the data based on the unique subcategory of interest and subsequently apply the formula: {outcome_variable} ~ {independent_variable}.

Code
def get_linear_regression_table(
    data, outcome_variable, independent_variable, category, covariate
):
    """
    Calculate linear regression coefficients, confidence intervals, and p-values for each unique category or subcategory
    within the specified category column.

    Parameters:
    - data (pandas.DataFrame): The input DataFrame containing the data.
    - outcome_variable (str): The name of the dependent variable.
    - independent_variable (str): The name of the independent variable.
    - category (str): The column name specifying the category or subcategory for grouping.
                      Should be either 'Category' or 'Subcategory'.
    - covariate (str or None): The name of the covariate variable to include in the regression model.
                               Set to None if no covariate is needed.

    Returns:
    - results (pandas.DataFrame): A DataFrame containing the results of linear regression analysis for each unique
                                  category or subcategory. Columns include:
      - 'unique_subcategory': The unique category or subcategory.
      - 'β': The coefficient for the independent_variable in the regression model.
      - 'Lower CI': The lower confidence interval for the coefficient.
      - 'Upper CI': The upper confidence interval for the coefficient.
      - 'p_value': The p-value for the coefficient.

    Example usage:
    >>> results = get_linear_regression_table(data, 'OutcomeVar', 'IndependentVar', 'Category', 'CovariateVar')
    """
    results_list = []

    if category == "Subcategory":
        cat = "Subcategory"
        unique_categories = data[cat].unique().tolist()
    else:
        cat = "Category"
        unique_categories = data[cat].unique().tolist()

    for unique_category in unique_categories:
        temp_df = data.loc[lambda df: df[cat] == unique_category, :]

        if covariate is not None:
            # Fit the linear regression model with covariate
            reg = smf.ols(
                f"{outcome_variable} ~ {covariate} + {independent_variable}",
                data=temp_df,
            ).fit(disp=0)
        else:
            # Fit the linear regression model without covariate
            reg = smf.ols(
                f"{outcome_variable} ~ {independent_variable}", data=temp_df
            ).fit(disp=0)

        # Get the regression coefficients, confidence intervals, and p-values
        beta = reg.params[-1]
        lower_CI = reg.conf_int()[0][-1]
        upper_CI = reg.conf_int()[1][-1]
        pvalue = reg.pvalues[-1]

        # Append the results to the list as a dictionary
        results_list.append(
            {
                "unique_subcategory": unique_category,
                "β": beta,
                "Lower CI": lower_CI,
                "Upper CI": upper_CI,
                "p_value": pvalue,
            }
        )

    # Convert the list of dictionaries into a DataFrame
    results = pd.DataFrame(results_list)
    return results

According to our linear regression analysis, it appears that activities like Materials (involving the extraction of substances for various purposes), Food Processing (which includes food transformation and preservation), and Food Growth & Collection (encompassing activities related to edible organic matter) exhibit the most pronounced negative relationships with GDP per capita. Specifically, the β coefficients for these activities are -6.21, -2.86, and -1.03, respectively, with all associated p-values falling below the 0.05 threshold.
Conversely, we observe the most robust positive relationships with GDP per capita in the case of Active Recreation (involving activities with elevated metabolic activity), Infrastructure (associated with the construction and maintenance of transportation and communication structures), and Waste Management (pertaining to the handling of waste and by-products outside of inhabited areas). These activities exhibit β coefficients of 2.1, 12.17, and 25.17, respectively, with all corresponding p-values falling below the 0.05 significance level.

Code
(
    get_linear_regression_table(
        df,
        outcome_variable="gdp_per_population_log",
        independent_variable="hoursPerDayCombined",
        category="Subcategory",
        covariate=None,
    )
    .assign(unique_subcategory=lambda df: df.unique_subcategory.str.replace("_", " "))
    .sort_values(["β", "p_value"])
    .style.format(precision=3, thousands=",", decimal=".")
    .applymap(
        lambda value: "color: red; font-weight:bold" if (value <= 0.05) else None, subset=["p_value"]
    )
    .background_gradient(cmap="viridis", axis=0, subset="β")
)
  unique_subcategory β Lower CI Upper CI p_value
3 Materials -6.215 -7.807 -4.623 0.000
2 Food processing -2.860 -4.685 -1.034 0.002
1 Food growth & collection -1.034 -1.194 -0.874 0.000
13 Hygiene & grooming -0.995 -1.570 -0.420 0.001
14 Schooling & research -0.880 -1.184 -0.576 0.000
15 Religious practice -0.659 -1.200 -0.117 0.017
19 Social -0.617 -0.878 -0.357 0.000
16 Human transportation -0.204 -0.777 0.369 0.484
9 Sleep & bedrest -0.196 -0.389 -0.004 0.046
23 Interactive -0.004 -0.346 0.337 0.980
21 Meals 0.020 -0.313 0.353 0.905
12 Health care 0.187 -0.473 0.846 0.577
11 Physical child care 0.299 -0.308 0.905 0.333
22 Passive 0.472 0.313 0.630 0.000
0 Food preparation 0.549 0.137 0.962 0.009
17 Material transportation 0.618 -0.202 1.438 0.139
8 Inhabited environment 0.830 0.469 1.190 0.000
4 Energy 0.874 -1.552 3.300 0.478
18 Allocation 1.088 0.814 1.362 0.000
7 Artifacts 1.254 0.700 1.809 0.000
6 Buildings 2.057 1.240 2.874 0.000
20 Active recreation 2.105 1.547 2.662 0.000
5 Infrastructure 12.166 9.691 14.641 0.000
10 Waste management 25.172 10.269 40.074 0.001

Although the existing analysis provides valuable insights, we can enhance our understanding further by building a model that considers both regions and different subcategories simultaneously. To achieve this, we’ll re-run our regression model with ‘gdp_per_population_log’ as the outcome variable and ‘hoursPerDayCombined’ as the independent variable. The ‘region_code’ will serve as a categorical variable, with Western Europe set as the reference level. Additionally, ‘Subcategory’ will also be treated as a categorical variable, with ‘Passive Activities’ chosen as the reference level. This approach will help us determine whether the hours spent on various activities exhibit any discernible relationship with GDP per capita.

As evident from the table below, none of the coefficients associated with the various subcategories showed significant differences compared to zero. This suggests that these subcategories do not exert a substantial effect on ‘gdp_per_population_log.’ Additionally, the β coefficient for ‘hoursPerDayCombined’ was also not statistically distinct from zero.

Conversely, when considering the impact of regions, we observe a substantial influence on the correlation. Most regions exhibit negative coefficients in comparison to Western Europe. The exceptions to this trend are North America, New Zealand, and Australia, where the corresponding p-values exceeded 0.05. These findings suggest that the region indeed plays a significant role in the correlation with ‘gdp_per_population_log.’

Code
res = smf.ols(
    """gdp_per_population_log ~  hoursPerDayCombined
+ C(region_code, Treatment(reference='EU_W')) 
+ C(Subcategory, Treatment(reference='Passive')) 
""",
    data=df,
).fit(disp=0)

res.summary2().tables[1]
(
    res.summary2()
    .tables[1]
    .style.format(precision=3, thousands=",", decimal=".")
    .applymap(
        lambda value: "color: red; font-weight:bold" if (value <= 0.05) else None, subset=["P>|t|"]
    )
    .background_gradient(cmap="viridis", axis=0, subset="Coef.")
)
  Coef. Std.Err. t P>|t| [0.025 0.975]
Intercept 4.839 0.077 62.658 0.000 4.688 4.990
C(region_code, Treatment(reference='EU_W'))[T.AF_E] -1.744 0.035 -49.872 0.000 -1.812 -1.675
C(region_code, Treatment(reference='EU_W'))[T.AF_M] -1.448 0.041 -35.646 0.000 -1.528 -1.369
C(region_code, Treatment(reference='EU_W'))[T.AF_N] -1.195 0.046 -25.993 0.000 -1.285 -1.105
C(region_code, Treatment(reference='EU_W'))[T.AF_S] -1.204 0.046 -26.200 0.000 -1.294 -1.114
C(region_code, Treatment(reference='EU_W'))[T.AF_W] -1.731 0.036 -48.166 0.000 -1.801 -1.660
C(region_code, Treatment(reference='EU_W'))[T.AM_C] -0.859 0.034 -25.209 0.000 -0.926 -0.792
C(region_code, Treatment(reference='EU_W'))[T.AM_N] -0.042 0.063 -0.673 0.501 -0.166 0.081
C(region_code, Treatment(reference='EU_W'))[T.AM_S] -0.961 0.037 -25.737 0.000 -1.034 -0.888
C(region_code, Treatment(reference='EU_W'))[T.ANZ] -0.002 0.063 -0.031 0.975 -0.125 0.121
C(region_code, Treatment(reference='EU_W'))[T.AS_C] -1.341 0.046 -29.170 0.000 -1.431 -1.251
C(region_code, Treatment(reference='EU_W'))[T.AS_E] -0.675 0.042 -16.077 0.000 -0.757 -0.592
C(region_code, Treatment(reference='EU_W'))[T.AS_S] -1.424 0.040 -35.999 0.000 -1.502 -1.347
C(region_code, Treatment(reference='EU_W'))[T.AS_SE] -1.119 0.034 -32.659 0.000 -1.186 -1.052
C(region_code, Treatment(reference='EU_W'))[T.AS_W] -0.864 0.035 -24.721 0.000 -0.933 -0.796
C(region_code, Treatment(reference='EU_W'))[T.EU_E] -0.780 0.039 -20.176 0.000 -0.856 -0.705
C(region_code, Treatment(reference='EU_W'))[T.EU_N] -0.131 0.039 -3.390 0.001 -0.207 -0.055
C(region_code, Treatment(reference='EU_W'))[T.EU_S] -0.645 0.037 -17.276 0.000 -0.718 -0.572
C(Subcategory, Treatment(reference='Passive'))[T.Active_recreation] -0.000 0.067 -0.006 0.995 -0.133 0.132
C(Subcategory, Treatment(reference='Passive'))[T.Allocation] -0.000 0.057 -0.005 0.996 -0.113 0.112
C(Subcategory, Treatment(reference='Passive'))[T.Artifacts] -0.000 0.070 -0.006 0.995 -0.137 0.136
C(Subcategory, Treatment(reference='Passive'))[T.Buildings] -0.000 0.072 -0.006 0.995 -0.142 0.141
C(Subcategory, Treatment(reference='Passive'))[T.Energy] -0.000 0.076 -0.006 0.995 -0.149 0.148
C(Subcategory, Treatment(reference='Passive'))[T.Food_growth_&_collection] -0.000 0.065 -0.006 0.995 -0.127 0.127
C(Subcategory, Treatment(reference='Passive'))[T.Food_preparation] -0.000 0.060 -0.006 0.996 -0.118 0.117
C(Subcategory, Treatment(reference='Passive'))[T.Food_processing] -0.000 0.075 -0.006 0.995 -0.148 0.147
C(Subcategory, Treatment(reference='Passive'))[T.Health_care] -0.000 0.072 -0.006 0.995 -0.141 0.140
C(Subcategory, Treatment(reference='Passive'))[T.Human_transportation] -0.000 0.059 -0.005 0.996 -0.116 0.115
C(Subcategory, Treatment(reference='Passive'))[T.Hygiene_&_grooming] -0.000 0.057 -0.005 0.996 -0.112 0.112
C(Subcategory, Treatment(reference='Passive'))[T.Infrastructure] -0.000 0.076 -0.006 0.995 -0.149 0.148
C(Subcategory, Treatment(reference='Passive'))[T.Inhabited_environment] -0.000 0.060 -0.006 0.996 -0.117 0.116
C(Subcategory, Treatment(reference='Passive'))[T.Interactive] -0.000 0.058 -0.005 0.996 -0.113 0.113
C(Subcategory, Treatment(reference='Passive'))[T.Material_transportation] -0.000 0.071 -0.006 0.995 -0.139 0.138
C(Subcategory, Treatment(reference='Passive'))[T.Materials] -0.000 0.075 -0.006 0.995 -0.148 0.147
C(Subcategory, Treatment(reference='Passive'))[T.Meals] -0.000 0.048 -0.004 0.997 -0.094 0.094
C(Subcategory, Treatment(reference='Passive'))[T.Physical_child_care] -0.000 0.070 -0.006 0.995 -0.138 0.137
C(Subcategory, Treatment(reference='Passive'))[T.Religious_practice] -0.000 0.072 -0.006 0.995 -0.142 0.141
C(Subcategory, Treatment(reference='Passive'))[T.Schooling_&_research] -0.000 0.056 -0.005 0.996 -0.110 0.109
C(Subcategory, Treatment(reference='Passive'))[T.Sleep_&_bedrest] 0.001 0.156 0.007 0.994 -0.304 0.307
C(Subcategory, Treatment(reference='Passive'))[T.Social] -0.000 0.057 -0.005 0.996 -0.112 0.111
C(Subcategory, Treatment(reference='Passive'))[T.Waste_management] -0.000 0.077 -0.006 0.995 -0.151 0.150
hoursPerDayCombined -0.000 0.024 -0.007 0.994 -0.048 0.047

Regression diagnostic

Next, we’ll perform some diagnostic procedures for our regression model. These diagnostics will serve the purpose of evaluating how well our model aligns with its underlying assumptions. This step is crucial for ensuring the reliability and validity of our regression analysis.
As depicted in Figure 4, our residuals exhibit a relatively normal distribution. While there are a few data points that may have contributed to some erroneous predictions, we have opted to maintain the current state of the analysis as the results appear reasonable.

Code
# Create a histogram plot of residuals
hist_plot = res.resid.to_frame(name="residuals").pipe(
    lambda df: ggplot(df, aes("residuals"))
    + geom_histogram(bins=15)
    + labs(
        title="Histogram of Residuals",
    )
    + theme(plot_title=element_text(size=15, face="bold"))
)

# Create a residuals plot
residuals_plot = (
    res.resid.to_frame(name="residuals")
    .reset_index()  # Reset the index for plotting
    .pipe(
        lambda df: ggplot(df, aes("index", "residuals"))
        + geom_point(size=5, alpha=0.05)  # Plot residuals as points
        + labs(
            title="Residuals",  # Title for the residuals plot
        )
        + theme(
            plot_title=element_text(size=15, face="bold")  # Customize title appearance
        )
    )
)


# Create a Q-Q plot of residuals
qqplot = res.resid.to_frame(name="residuals").pipe(
    lambda df: qq_plot(data=df, sample="residuals", size=5, alpha=0.05)
    + labs(
        title="Q-Q Plot",
    )
    + theme(plot_title=element_text(size=15, face="bold"))
)

# Create a fit plot comparing predicted values and log(GDP per population)
fit_plot = (
    pd.concat([res.fittedvalues, df.gdp_per_population_log], axis=1)
    .rename(
        columns={
            0: "Predicted value",
            "gdp_per_population_log": "log(GDP per population)",
        }
    )
    .pipe(
        lambda df: ggplot(df, aes("Predicted value", "log(GDP per population)"))
        + geom_point(size=5, alpha=0.05)
        + labs(
            title="Fit Plot",
        )
        + theme(plot_title=element_text(size=15, face="bold"))
        + geom_abline(slope=1, size=1, linetype="dashed", color="red")
    )
)

# Combine the three plots into a grid with two columns
(gggrid([hist_plot, residuals_plot, qqplot, fit_plot], ncol=2) + ggsize(800, 600))
Figure 4: Regression diagnostic

And there you have it, folks! We hope you’ve found this week’s exploration insightful. Happy coding to all, and I look forward to seeing you next week for more exciting discoveries and learning opportunities. Until then, stay curious and keep exploring! 👩‍💻👨‍💻