This week’s Tidy Tuesday takes us to The Human Chronome Project, an initiative based at McGill University in Montreal. We’re exploring insights from their paper titled “The global human day in PNAS” and the accompanying dataset on Zenodo.

The publication offers a broad view of our species’ activities, shedding light on how economic activities fit into the grand scheme of life. It also highlights activities with significant potential for change. For a deeper dive into our research methodology and to explore supplementary visualizations, I encourage you to peruse the supplementary materials provided with the publication.

Import data

Code

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm
from lets_plot.bistro.qq import qq_plot

import flag
from PIL import Image
from pathlib import Path

from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

While we have four distinct datasets at our disposal, our primary focus will be on two of them. The ‘all_countries’ dataset holds comprehensive information about the time allocated to each activity per region and country, while the ‘country_regions’ dataset provides the essential correspondence between country names and their ISO-2 and ISO-3 abbreviations. Let’s proceed by reading them all in.

Code

all_countries = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-12/all_countries.csv"
)
country_regions = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-12/country_regions.csv"
)
global_human_day = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-12/global_human_day.csv"
)
global_economic_activity = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-12/global_economic_activity.csv"
)

Data dictionaries

Here are the corresponding data dictionaries:

Data Dictionary

all_countries.csv

variable	class	description
Category	character	M24 categories
Subcategory	character	M24 subcategories
country_iso3	character	Country code in iso3
region_code	character	Region code
population	double	Population
hoursPerDayCombined	double	Hours per day combined for the country
uncertaintyCombined	double	Uncertainty combined. Uncertainty is in units variance.

country_regions.csv

variable	class	description
region_code	character	Region code
region_name	character	Region name
country_name	character	Country name
M49_code	double	M49 code
country_iso2	character	Country code in iso2
country_iso3	character	Country code in iso3
alt_country_name	character	Alternative country name
alt_country_name1	character	Alternative country name 1
alt_country_name2	character	Alternative country name 2
alt_country_name3	character	Alternative country name 3
alt_country_name4	character	Alternative country name 4
alt_country_name5	character	Alternative country name 5
alt_country_name6	character	Alternative country name 6
other_code1	character	Other country code 1
other_code2	character	Other country code 2

global_human_day.csv

variable	class	description
Subcategory	character	M24 subcategory
hoursPerDay	double	Hours per day for all countries
uncertainty	double	Uncertainty in units variance.

global_economic_activity.csv

variable	class	description
Subcategory	character	M24 subcategory
hoursPerDay	double	Hours per day for all countries.
uncertainty	double	Uncertainty in units variance.

As in our previous Tidy Tuesday blog, I think we can address the following questions using this dataset: 1. Which regions allocate the most and least time to each subcategory? 2. Which countries exhibit the highest and lowest time allocation for each subcategory? 3. Is there a discernible relationship between time spent on each category and GDP per capita?

Get GDP per capita data from Wikipedia

To prepare for our third question, let’s grab GDP per capita data from Wikipedia. It will come in handy as we dive deeper into our analysis!

Code

# Define a dictionary for country name replacements
names_to_replace = {
    "Russia": "Russian Federation",
    "South Korea": "Korea, Republic of",
    "Vietnam": "Viet Nam",
    "Hong Kong": "Hong Kong, China",
    "Iran": "Iran, Islamic Republic of",
    "Venezuela": "Venezuela, Bolivarian Republic of",
    "Tanzania": "Tanzania, United Republic of",
    "DR Congo": "Congo",
    "Palestine": "Occupied Palestinian Territory",
    "Brunei": "Brunei Darussalam",
    "North Korea": "Korea Democratic People's Republic of",
    "Moldova": "Moldova, Republic of",
    "Syria": "Syrian Arab Republic",
    "North Macedonia": "Macedonia the former Yugoslav Republic of",
    "Laos": "Lao People's Democratic Republic",
    "Eswatini": "Swaziland",
    "Curaçao": "Curacao",
    "East Timor": "Timor-Leste",
    "Saint Vincent and the Grenadines": "Saint Vincent and Grenadines",
    "Sint Maarten": "Sint Maarten (Kingdom of the Netherlands)",
    "São Tomé and Príncipe": "Sao Tome and Principe",
    "Micronesia": "Micronesia Federated States of",
    "Macau": "Macau, China",
}

# Read GDP data from Wikipedia
gdp = (
    pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")[2]
    .pipe(lambda s: s.set_axis(s.columns.map("_".join), axis=1))
    .loc[
        :,
        [
            "Country/Territory_Country/Territory",
            "UN region_UN region",
            "United Nations[15]_Estimate",
        ],
    ]
    .rename(
        columns={
            "Country/Territory_Country/Territory": "country",
            "UN region_UN region": "UN_region",
            "United Nations[15]_Estimate": "gdp",
        }
    )
    .assign(country=lambda df: df.country.replace(names_to_replace))
    .assign(
        gdp=lambda df: df.gdp.replace(r"\D+", "", regex=True),
        iso3=lambda df: df.country.map(
            country_regions.loc[:, ["alt_country_name", "country_iso3"]]
            .set_index("alt_country_name")
            .squeeze()
            .to_dict()
        ),
    )
    .query("~iso3.isna()")
)
gdp

	country	UN_region	gdp	iso3
1	United States	Americas	23315081	USA
2	China	Asia	17734131	CHN
3	Japan	Asia	4940878	JPN
4	Germany	Europe	4259935	DEU
5	India	Asia	3201471	IND
...	...	...	...	...
209	Anguilla	Americas	303	AIA
210	Kiribati	Oceania	227	KIR
211	Nauru	Oceania	155	NRU
212	Montserrat	Americas	72	MSR
213	Tuvalu	Oceania	60	TUV

209 rows × 4 columns

After obtaining the GDP per capita data, we’ll integrate it into our main dataset.

Code

df = (
    all_countries.assign(
        HPD_with_u=lambda df: df.hoursPerDayCombined + df.uncertaintyCombined
    )
    .assign(
        gdp=lambda df: df.country_iso3.map(gdp.set_index("iso3").gdp.to_dict()).astype(
            float
        )
    )
    .query("~gdp.isna()")
    .assign(
        gdp_per_population=lambda df: df.gdp / df.population.div(1_000_000),
        gdp_per_population_log=lambda df: np.log10(df.gdp_per_population),
        Subcategory=lambda df: df.Subcategory.str.replace(" ", "_"),
    )
)

df

	Category	Subcategory	country_iso3	region_code	population	hoursPerDayCombined	uncertaintyCombined	HPD_with_u	gdp	gdp_per_population	gdp_per_population_log
0	Food provision	Food_preparation	ABW	AM_C	101665	1.47	0.237630	1.707630	3126.0	30748.045050	4.487818
1	Food provision	Food_growth_&_collection	ABW	AM_C	101665	0.17	0.026880	0.196880	3126.0	30748.045050	4.487818
2	Food provision	Food_processing	ABW	AM_C	101665	0.00	0.007526	0.007526	3126.0	30748.045050	4.487818
3	Nonfood provision	Materials	ABW	AM_C	101665	0.03	0.002072	0.032072	3126.0	30748.045050	4.487818
4	Nonfood provision	Energy	ABW	AM_C	101665	0.05	0.003597	0.053597	3126.0	30748.045050	4.487818
...	...	...	...	...	...	...	...	...	...	...	...
4771	Experience oriented	Social	ZWE	AF_E	14645473	0.93	0.594238	1.524238	24118.0	1646.788738	3.216638
4772	Experience oriented	Active_recreation	ZWE	AF_E	14645473	0.39	0.100182	0.490182	24118.0	1646.788738	3.216638
4773	Experience oriented	Meals	ZWE	AF_E	14645473	1.88	1.262100	3.142100	24118.0	1646.788738	3.216638
4774	Experience oriented	Passive	ZWE	AF_E	14645473	2.25	1.815564	4.065564	24118.0	1646.788738	3.216638
4775	Experience oriented	Interactive	ZWE	AF_E	14645473	1.10	0.350849	1.450849	24118.0	1646.788738	3.216638

4512 rows × 11 columns

Which regions allocate the most and least time to each subcategory?

First, to provide an overview of the prevailing trends, we will begin by examining disparities in time allocation across various subcategories within different regions. Following this, we will delve into a more granular analysis (as part of our second question), considering the data at the country level. As depicted in Figure 1, a noticeable pattern emerges. Typically, the minimum and maximum values for each activity type appear quite similar. However, it gets intriguing when we focus on certain categories like ‘Social,’ ‘Food Growth & Collection,’ ‘Schooling & Research,’ ‘Sleep & Bedrest,’ and ‘Passive.’ In these categories, we observe more significant differences, with gaps ranging from 1.05 to 1.59 hours between the maximum and minimum hours spent.

Code

maximum_values = (
    df.groupby(["region_code", "Subcategory"])
    .mean(numeric_only=True)
    .reset_index()
    .groupby(["Subcategory"])
    .apply(lambda x: x.sort_values(ascending=False, by="hoursPerDayCombined").head(1))
    .reset_index(drop=True)[["region_code", "Subcategory", "hoursPerDayCombined"]]
)

minimum_values = (
    df.groupby(["region_code", "Subcategory"])
    .mean(numeric_only=True)
    .reset_index()
    .groupby(["Subcategory"])
    .apply(lambda x: x.sort_values(ascending=True, by="hoursPerDayCombined").head(1))
    .reset_index(drop=True)[["region_code", "Subcategory", "hoursPerDayCombined"]]
)

# Concatenate the maximum and minimum records
concatenated_records = pd.concat([maximum_values, minimum_values], axis=0).reset_index(
    drop=True
)

# Assign flags, country names, and format subcategory names
(
    concatenated_records.assign(
        Subcategory=lambda df: df.Subcategory.str.replace("_", " "),
        group=lambda df: df.region_code.mask(
            df.region_code.str.contains("EU"), True
        ).mask(~df.region_code.str.contains("EU"), False),
    ).pipe(
        lambda df: ggplot(
            df,
            aes(
                "hoursPerDayCombined",
                as_discrete("Subcategory", order=1, order_by="hoursPerDayCombined"),
            ),
        )
        + geom_point(aes(color="group"), size=5, alpha=0.5, show_legend=False)
        + geom_text(
            aes(label="region_code"),
            position=position_nudge(y=0.25, x=-0),
            size=5,
            na_text="",
            angle=15,
            color="#2b8cbe",
            tooltips=layer_tooltips(
                ["Subcategory", "region_code", "hoursPerDayCombined"]
            ),
        )
        + labs(
            title="Regions with the Most and Least Time Spent on Each Subcategory",  # Title
            subtitle="""Abbreviations: EU = Europe, AF = Africa, AS = Asia, AM = Americas, ANZ = Australia and New Zealand
            E = East, W = West, N = North, S = South, C = Central, M = Middle
            Values from Europe are highlighted in green.
            """,  # Subtitle
            x="Mean Hours per day combined per Region",
            y="",  # Y-axis label
            caption="doi.org/10.1073/pnas.2219564120",  # Caption with the source DOI
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + scale_x_continuous(
            trans="log10", limits=[0.001, 24]
        )  # Set limits for the x-axis
        + ggsize(800, 600)  # Set the plot size
    )
)

Figure 1: Regions with the Most and Least Time Spent on Each Subcategory

Which countries allocate the most and least time to each subcategory?

As depicted in Figure 2 below, on average, the most significant time allocation is dedicated to Sleep and Bedrest, followed by Passive, Interactive, and Social Activities, as well as Activities Centered on Eating and Drinking, including Associated Socializing. While drawing definitive conclusions from this figure may be challenging, Japan stands out as a unique case. They allocate the least amount of time to Sleep and Bedrest while devoting the most time to Passive Activities and Hygiene and Grooming—an interesting observation.

Code

# Calculate the maximum values for each subcategory
maximum_values = (
    df.groupby(["Subcategory"])
    .apply(lambda x: x.sort_values(ascending=False, by="hoursPerDayCombined").head(1))[
        ["country_iso3", "Subcategory", "hoursPerDayCombined"]
    ]
    .reset_index(drop=True)
    .assign(
        iso2=lambda df: df.country_iso3.map(
            country_regions.set_index("country_iso3").country_iso2.squeeze().to_dict()
        )
    )
)

# Calculate the minimum values for each subcategory
minimum_values = (
    df.groupby(["Subcategory"])
    .apply(lambda x: x.sort_values(ascending=True, by="hoursPerDayCombined").head(1))[
        ["country_iso3", "Subcategory", "hoursPerDayCombined"]
    ]
    .reset_index(drop=True)
    .assign(
        iso2=lambda df: df.country_iso3.map(
            country_regions.set_index("country_iso3").country_iso2.squeeze().to_dict()
        )
    )
)

# Concatenate the maximum and minimum records
concatenated_records = pd.concat([maximum_values, minimum_values], axis=0).reset_index(
    drop=True
)

# Assign flags, country names, and format subcategory names
(
    concatenated_records.assign(
        flag=lambda df: df.iso2.apply(flag.flag),
        name=lambda df: df.country_iso3.map(
            country_regions.set_index("country_iso3").country_name.squeeze().to_dict()
        ),
        Subcategory=lambda df: df.Subcategory.str.replace("_", " "),
    ).pipe(
        lambda df: ggplot(
            df,
            aes(
                "hoursPerDayCombined",
                as_discrete("Subcategory", order=1, order_by="hoursPerDayCombined"),
            ),
        )
        + geom_point(size=5, alpha=0.25)
        + geom_text(
            aes(label="flag"),
            position=position_nudge(y=0.5, x=-0),
            size=8,
            na_text="",
            angle=15,
            color="#2b8cbe",
            tooltips=layer_tooltips(["Subcategory", "name", "hoursPerDayCombined"]),
        )
        + labs(
            title="Countries with the Most and Least Time Spent on Each Subcategory",  # Title
            subtitle="""On average, the highest time allocation is for Sleep and Bedrest, followed by Passive, Interactive, and 
            Social Activities, and Activities Centered on Eating and Drinking, including Associated Socializing""",  # Subtitle
            x="Hours per day combined for the country",
            y="",  # Y-axis label
            caption="doi.org/10.1073/pnas.2219564120",  # Caption with the source DOI
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + scale_x_continuous(limits=[0, 11])  # Set limits for the x-axis
        + ggsize(800, 600)  # Set the plot size
    )
)

Figure 2: Countries with the Most and Least Time Spent on Each Subcategory

Is there a discernible relationship between time spent on each category and GDP per capita?

Now, let’s make the most of our ‘gdp_per_population_log’ column. Please note that I’m not an economics expert, and our goal here is to identify intriguing trends without delving too deeply into interpretation. For a comprehensive understanding of the dataset, I recommend referring to the original publication.

Let’s begin by examining the broader trends. We aim to identify activities that appear to exhibit a noteworthy correlation with GDP per population, bearing in mind that GDP per population is log-transformed for our analysis. As evident from Figure 3, we’ve come across some intriguing, and perhaps not entirely surprising, observations. For instance, the data hints at a negative relationship between the hours per day spent on schooling and research and GDP per capita. This finding might raise eyebrows, as conventional wisdom often suggests that higher education leads to increased economic output. However, it’s worth considering that spending too much time in the educational system could potentially limit one’s participation in economic production compared to individuals with lower levels of education. It’s a thought-provoking discovery, and while it may challenge some assumptions, it certainly adds depth to our understanding of the data.

Code

(
    df.assign(Subcategory=lambda df: df.Subcategory.str.replace("_", " ")).pipe(
        lambda df: ggplot(df, aes("hoursPerDayCombined", "gdp_per_population_log"))
        + geom_point(alpha=0.5)
        + geom_smooth()
        + facet_wrap("Subcategory", scales="free")
        + labs(
            title="Relationship between Hours per Day Combined and Log(GDP per Population)",  # Title
            subtitle="""Activities vs. GDP per Capita at the country scale. Each circle on the chart 
            represents the population-average time spent per day for an individual country.""",
            x="Hours per day combined",
            y="log(GDP per population)",  # Y-axis label
            caption="doi.org/10.1073/pnas.2219564120",  # Caption with the source DOI
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 800)  # Set the plot size
    )
)

Figure 3: Relationship between Hours per Day Combined and Log(GDP per Population)

To further explore the connection between ‘gdp_per_population_log’ and ‘hoursPerDayCombined,’ we can utilize the statsmodels library to execute an Ordinary Least Squares (OLS) regression analysis. This method enables us to compute the β coefficients, in addition to establishing upper and lower confidence intervals, as well as deriving the associated p-values. To streamline this process, we’ll simplify our workflow by creating a function called get_linear_regression_table. This function will initially filter the data based on the unique subcategory of interest and subsequently apply the formula: {outcome_variable} ~ {independent_variable}.

Code

def get_linear_regression_table(
    data, outcome_variable, independent_variable, category, covariate
):
    """
    Calculate linear regression coefficients, confidence intervals, and p-values for each unique category or subcategory
    within the specified category column.

    Parameters:
    - data (pandas.DataFrame): The input DataFrame containing the data.
    - outcome_variable (str): The name of the dependent variable.
    - independent_variable (str): The name of the independent variable.
    - category (str): The column name specifying the category or subcategory for grouping.
                      Should be either 'Category' or 'Subcategory'.
    - covariate (str or None): The name of the covariate variable to include in the regression model.
                               Set to None if no covariate is needed.

    Returns:
    - results (pandas.DataFrame): A DataFrame containing the results of linear regression analysis for each unique
                                  category or subcategory. Columns include:
      - 'unique_subcategory': The unique category or subcategory.
      - 'β': The coefficient for the independent_variable in the regression model.
      - 'Lower CI': The lower confidence interval for the coefficient.
      - 'Upper CI': The upper confidence interval for the coefficient.
      - 'p_value': The p-value for the coefficient.

    Example usage:
    >>> results = get_linear_regression_table(data, 'OutcomeVar', 'IndependentVar', 'Category', 'CovariateVar')
    """
    results_list = []

    if category == "Subcategory":
        cat = "Subcategory"
        unique_categories = data[cat].unique().tolist()
    else:
        cat = "Category"
        unique_categories = data[cat].unique().tolist()

    for unique_category in unique_categories:
        temp_df = data.loc[lambda df: df[cat] == unique_category, :]

        if covariate is not None:
            # Fit the linear regression model with covariate
            reg = smf.ols(
                f"{outcome_variable} ~ {covariate} + {independent_variable}",
                data=temp_df,
            ).fit(disp=0)
        else:
            # Fit the linear regression model without covariate
            reg = smf.ols(
                f"{outcome_variable} ~ {independent_variable}", data=temp_df
            ).fit(disp=0)

        # Get the regression coefficients, confidence intervals, and p-values
        beta = reg.params[-1]
        lower_CI = reg.conf_int()[0][-1]
        upper_CI = reg.conf_int()[1][-1]
        pvalue = reg.pvalues[-1]

        # Append the results to the list as a dictionary
        results_list.append(
            {
                "unique_subcategory": unique_category,
                "β": beta,
                "Lower CI": lower_CI,
                "Upper CI": upper_CI,
                "p_value": pvalue,
            }
        )

    # Convert the list of dictionaries into a DataFrame
    results = pd.DataFrame(results_list)
    return results

According to our linear regression analysis, it appears that activities like Materials (involving the extraction of substances for various purposes), Food Processing (which includes food transformation and preservation), and Food Growth & Collection (encompassing activities related to edible organic matter) exhibit the most pronounced negative relationships with GDP per capita. Specifically, the β coefficients for these activities are -6.21, -2.86, and -1.03, respectively, with all associated p-values falling below the 0.05 threshold.
Conversely, we observe the most robust positive relationships with GDP per capita in the case of Active Recreation (involving activities with elevated metabolic activity), Infrastructure (associated with the construction and maintenance of transportation and communication structures), and Waste Management (pertaining to the handling of waste and by-products outside of inhabited areas). These activities exhibit β coefficients of 2.1, 12.17, and 25.17, respectively, with all corresponding p-values falling below the 0.05 significance level.

Code

(
    get_linear_regression_table(
        df,
        outcome_variable="gdp_per_population_log",
        independent_variable="hoursPerDayCombined",
        category="Subcategory",
        covariate=None,
    )
    .assign(unique_subcategory=lambda df: df.unique_subcategory.str.replace("_", " "))
    .sort_values(["β", "p_value"])
    .style.format(precision=3, thousands=",", decimal=".")
    .applymap(
        lambda value: "color: red; font-weight:bold" if (value <= 0.05) else None, subset=["p_value"]
    )
    .background_gradient(cmap="viridis", axis=0, subset="β")
)

	unique_subcategory	β	Lower CI	Upper CI	p_value
3	Materials	-6.215	-7.807	-4.623	0.000
2	Food processing	-2.860	-4.685	-1.034	0.002
1	Food growth & collection	-1.034	-1.194	-0.874	0.000
13	Hygiene & grooming	-0.995	-1.570	-0.420	0.001
14	Schooling & research	-0.880	-1.184	-0.576	0.000
15	Religious practice	-0.659	-1.200	-0.117	0.017
19	Social	-0.617	-0.878	-0.357	0.000
16	Human transportation	-0.204	-0.777	0.369	0.484
9	Sleep & bedrest	-0.196	-0.389	-0.004	0.046
23	Interactive	-0.004	-0.346	0.337	0.980
21	Meals	0.020	-0.313	0.353	0.905
12	Health care	0.187	-0.473	0.846	0.577
11	Physical child care	0.299	-0.308	0.905	0.333
22	Passive	0.472	0.313	0.630	0.000
0	Food preparation	0.549	0.137	0.962	0.009
17	Material transportation	0.618	-0.202	1.438	0.139
8	Inhabited environment	0.830	0.469	1.190	0.000
4	Energy	0.874	-1.552	3.300	0.478
18	Allocation	1.088	0.814	1.362	0.000
7	Artifacts	1.254	0.700	1.809	0.000
6	Buildings	2.057	1.240	2.874	0.000
20	Active recreation	2.105	1.547	2.662	0.000
5	Infrastructure	12.166	9.691	14.641	0.000
10	Waste management	25.172	10.269	40.074	0.001

Although the existing analysis provides valuable insights, we can enhance our understanding further by building a model that considers both regions and different subcategories simultaneously. To achieve this, we’ll re-run our regression model with ‘gdp_per_population_log’ as the outcome variable and ‘hoursPerDayCombined’ as the independent variable. The ‘region_code’ will serve as a categorical variable, with Western Europe set as the reference level. Additionally, ‘Subcategory’ will also be treated as a categorical variable, with ‘Passive Activities’ chosen as the reference level. This approach will help us determine whether the hours spent on various activities exhibit any discernible relationship with GDP per capita.

As evident from the table below, none of the coefficients associated with the various subcategories showed significant differences compared to zero. This suggests that these subcategories do not exert a substantial effect on ‘gdp_per_population_log.’ Additionally, the β coefficient for ‘hoursPerDayCombined’ was also not statistically distinct from zero.

Conversely, when considering the impact of regions, we observe a substantial influence on the correlation. Most regions exhibit negative coefficients in comparison to Western Europe. The exceptions to this trend are North America, New Zealand, and Australia, where the corresponding p-values exceeded 0.05. These findings suggest that the region indeed plays a significant role in the correlation with ‘gdp_per_population_log.’

Code

res = smf.ols(
    """gdp_per_population_log ~  hoursPerDayCombined
+ C(region_code, Treatment(reference='EU_W')) 
+ C(Subcategory, Treatment(reference='Passive')) 
""",
    data=df,
).fit(disp=0)

res.summary2().tables[1]
(
    res.summary2()
    .tables[1]
    .style.format(precision=3, thousands=",", decimal=".")
    .applymap(
        lambda value: "color: red; font-weight:bold" if (value <= 0.05) else None, subset=["P>|t|"]
    )
    .background_gradient(cmap="viridis", axis=0, subset="Coef.")
)

	Coef.	Std.Err.	t	P>\|t\|	[0.025	0.975]
Intercept	4.839	0.077	62.658	0.000	4.688	4.990
C(region_code, Treatment(reference='EU_W'))[T.AF_E]	-1.744	0.035	-49.872	0.000	-1.812	-1.675
C(region_code, Treatment(reference='EU_W'))[T.AF_M]	-1.448	0.041	-35.646	0.000	-1.528	-1.369
C(region_code, Treatment(reference='EU_W'))[T.AF_N]	-1.195	0.046	-25.993	0.000	-1.285	-1.105
C(region_code, Treatment(reference='EU_W'))[T.AF_S]	-1.204	0.046	-26.200	0.000	-1.294	-1.114
C(region_code, Treatment(reference='EU_W'))[T.AF_W]	-1.731	0.036	-48.166	0.000	-1.801	-1.660
C(region_code, Treatment(reference='EU_W'))[T.AM_C]	-0.859	0.034	-25.209	0.000	-0.926	-0.792
C(region_code, Treatment(reference='EU_W'))[T.AM_N]	-0.042	0.063	-0.673	0.501	-0.166	0.081
C(region_code, Treatment(reference='EU_W'))[T.AM_S]	-0.961	0.037	-25.737	0.000	-1.034	-0.888
C(region_code, Treatment(reference='EU_W'))[T.ANZ]	-0.002	0.063	-0.031	0.975	-0.125	0.121
C(region_code, Treatment(reference='EU_W'))[T.AS_C]	-1.341	0.046	-29.170	0.000	-1.431	-1.251
C(region_code, Treatment(reference='EU_W'))[T.AS_E]	-0.675	0.042	-16.077	0.000	-0.757	-0.592
C(region_code, Treatment(reference='EU_W'))[T.AS_S]	-1.424	0.040	-35.999	0.000	-1.502	-1.347
C(region_code, Treatment(reference='EU_W'))[T.AS_SE]	-1.119	0.034	-32.659	0.000	-1.186	-1.052
C(region_code, Treatment(reference='EU_W'))[T.AS_W]	-0.864	0.035	-24.721	0.000	-0.933	-0.796
C(region_code, Treatment(reference='EU_W'))[T.EU_E]	-0.780	0.039	-20.176	0.000	-0.856	-0.705
C(region_code, Treatment(reference='EU_W'))[T.EU_N]	-0.131	0.039	-3.390	0.001	-0.207	-0.055
C(region_code, Treatment(reference='EU_W'))[T.EU_S]	-0.645	0.037	-17.276	0.000	-0.718	-0.572
C(Subcategory, Treatment(reference='Passive'))[T.Active_recreation]	-0.000	0.067	-0.006	0.995	-0.133	0.132
C(Subcategory, Treatment(reference='Passive'))[T.Allocation]	-0.000	0.057	-0.005	0.996	-0.113	0.112
C(Subcategory, Treatment(reference='Passive'))[T.Artifacts]	-0.000	0.070	-0.006	0.995	-0.137	0.136
C(Subcategory, Treatment(reference='Passive'))[T.Buildings]	-0.000	0.072	-0.006	0.995	-0.142	0.141
C(Subcategory, Treatment(reference='Passive'))[T.Energy]	-0.000	0.076	-0.006	0.995	-0.149	0.148
C(Subcategory, Treatment(reference='Passive'))[T.Food_growth_&_collection]	-0.000	0.065	-0.006	0.995	-0.127	0.127
C(Subcategory, Treatment(reference='Passive'))[T.Food_preparation]	-0.000	0.060	-0.006	0.996	-0.118	0.117
C(Subcategory, Treatment(reference='Passive'))[T.Food_processing]	-0.000	0.075	-0.006	0.995	-0.148	0.147
C(Subcategory, Treatment(reference='Passive'))[T.Health_care]	-0.000	0.072	-0.006	0.995	-0.141	0.140
C(Subcategory, Treatment(reference='Passive'))[T.Human_transportation]	-0.000	0.059	-0.005	0.996	-0.116	0.115
C(Subcategory, Treatment(reference='Passive'))[T.Hygiene_&_grooming]	-0.000	0.057	-0.005	0.996	-0.112	0.112
C(Subcategory, Treatment(reference='Passive'))[T.Infrastructure]	-0.000	0.076	-0.006	0.995	-0.149	0.148
C(Subcategory, Treatment(reference='Passive'))[T.Inhabited_environment]	-0.000	0.060	-0.006	0.996	-0.117	0.116
C(Subcategory, Treatment(reference='Passive'))[T.Interactive]	-0.000	0.058	-0.005	0.996	-0.113	0.113
C(Subcategory, Treatment(reference='Passive'))[T.Material_transportation]	-0.000	0.071	-0.006	0.995	-0.139	0.138
C(Subcategory, Treatment(reference='Passive'))[T.Materials]	-0.000	0.075	-0.006	0.995	-0.148	0.147
C(Subcategory, Treatment(reference='Passive'))[T.Meals]	-0.000	0.048	-0.004	0.997	-0.094	0.094
C(Subcategory, Treatment(reference='Passive'))[T.Physical_child_care]	-0.000	0.070	-0.006	0.995	-0.138	0.137
C(Subcategory, Treatment(reference='Passive'))[T.Religious_practice]	-0.000	0.072	-0.006	0.995	-0.142	0.141
C(Subcategory, Treatment(reference='Passive'))[T.Schooling_&_research]	-0.000	0.056	-0.005	0.996	-0.110	0.109
C(Subcategory, Treatment(reference='Passive'))[T.Sleep_&_bedrest]	0.001	0.156	0.007	0.994	-0.304	0.307
C(Subcategory, Treatment(reference='Passive'))[T.Social]	-0.000	0.057	-0.005	0.996	-0.112	0.111
C(Subcategory, Treatment(reference='Passive'))[T.Waste_management]	-0.000	0.077	-0.006	0.995	-0.151	0.150
hoursPerDayCombined	-0.000	0.024	-0.007	0.994	-0.048	0.047

Regression diagnostic

Next, we’ll perform some diagnostic procedures for our regression model. These diagnostics will serve the purpose of evaluating how well our model aligns with its underlying assumptions. This step is crucial for ensuring the reliability and validity of our regression analysis.
As depicted in Figure 4, our residuals exhibit a relatively normal distribution. While there are a few data points that may have contributed to some erroneous predictions, we have opted to maintain the current state of the analysis as the results appear reasonable.

Code

# Create a histogram plot of residuals
hist_plot = res.resid.to_frame(name="residuals").pipe(
    lambda df: ggplot(df, aes("residuals"))
    + geom_histogram(bins=15)
    + labs(
        title="Histogram of Residuals",
    )
    + theme(plot_title=element_text(size=15, face="bold"))
)

# Create a residuals plot
residuals_plot = (
    res.resid.to_frame(name="residuals")
    .reset_index()  # Reset the index for plotting
    .pipe(
        lambda df: ggplot(df, aes("index", "residuals"))
        + geom_point(size=5, alpha=0.05)  # Plot residuals as points
        + labs(
            title="Residuals",  # Title for the residuals plot
        )
        + theme(
            plot_title=element_text(size=15, face="bold")  # Customize title appearance
        )
    )
)


# Create a Q-Q plot of residuals
qqplot = res.resid.to_frame(name="residuals").pipe(
    lambda df: qq_plot(data=df, sample="residuals", size=5, alpha=0.05)
    + labs(
        title="Q-Q Plot",
    )
    + theme(plot_title=element_text(size=15, face="bold"))
)

# Create a fit plot comparing predicted values and log(GDP per population)
fit_plot = (
    pd.concat([res.fittedvalues, df.gdp_per_population_log], axis=1)
    .rename(
        columns={
            0: "Predicted value",
            "gdp_per_population_log": "log(GDP per population)",
        }
    )
    .pipe(
        lambda df: ggplot(df, aes("Predicted value", "log(GDP per population)"))
        + geom_point(size=5, alpha=0.05)
        + labs(
            title="Fit Plot",
        )
        + theme(plot_title=element_text(size=15, face="bold"))
        + geom_abline(slope=1, size=1, linetype="dashed", color="red")
    )
)

# Combine the three plots into a grid with two columns
(gggrid([hist_plot, residuals_plot, qqplot, fit_plot], ncol=2) + ggsize(800, 600))

Figure 4: Regression diagnostic

And there you have it, folks! We hope you’ve found this week’s exploration insightful. Happy coding to all, and I look forward to seeing you next week for more exciting discoveries and learning opportunities. Until then, stay curious and keep exploring! 👩‍💻👨‍💻