Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

Welcome to this week’s Tidy Tuesday! Today, we’re delving into the intriguing yet bothersome realm of email spams. You know, those unsolicited messages that flood our inboxes? They go by many names like junk email or spam mail, and they’re sent in bulk. The term “spam” got its name from a Monty Python sketch where the word “Spam” was everywhere, just like these emails. Starting from the early 1990s, these spam messages have been on a steady rise, making up around 90% of all email traffic by 2014.

We’ve got our data from the Tidy Tuesday treasure trove over at GitHub! This dataset, from Vincent Arel-Bundock’s Rdatasets package, was initially gathered at Hewlett-Packard Labs. They later kindly shared it with the UCI Machine Learning Repository.

This treasure trove of information consists of 4601 emaily sorted into spam and non-spam categorie

Setup

Code

import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import polars.selectors as cs
import numpy as np
from sklearn import preprocessing
from sklearn import decomposition


from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

import plotly.io as pio
import plotly.express as px

pio.templates.default = "presentation"

Code

df = pl.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-15/spam.csv"
).with_columns(
    [
        pl.col("crl.tot").cast(pl.Int16),
        pl.col("dollar").cast(pl.Float32),
        pl.col("bang").cast(pl.Float32),
        pl.col("money").cast(pl.Float32),
        pl.col("n000").cast(pl.Float32),
        pl.col("make").cast(pl.Float32),
        pl.col("yesno").map_dict({"y": 1, "n": 0}).cast(pl.Int8),
    ]
)

df.sample(5)

shape: (5, 7)

crl.tot	dollar	bang	money	n000	make	yesno
i16	f32	f32	f32	f32	f32	i8
5	0.0	0.0	0.0	0.0	0.0	0
57	0.0	0.374	0.0	0.0	0.0	0
75	0.0	0.0	0.0	0.0	0.0	0
23	0.0	0.0	0.0	0.0	0.0	0
65	0.0	0.0	0.0	0.0	0.0	0

Here is the data dictionary:

Variable	Class	Description
crl.tot	double	Total length of uninterrupted sequences of capitals
dollar	double	Occurrences of the dollar sign, as percent of total number of characters
bang	double	Occurrences of ‘!’, as percent of total number of characters
money	double	Occurrences of ‘money’, as percent of total number of characters
n000	double	Occurrences of the string ‘000’, as percent of total number of words
make	double	Occurrences of ‘make’, as a percent of total number of words
yesno	character	Outcome variable, a factor with levels ‘n’ not spam, ‘y’ spam

Looking at the data dictionary, a few intriguing questions arise:

How is the distribution between True and False answers? Are non-spam emails significantly more prevalent?
What are the top words that frequently appear in spam emails?
Can we explore the relative frequency distribution disparities of words linked with spam? What’s the increased likelihood of finding these specific words in a spam email versus a non-spam one?
Is there an overarching correlation within the dataset? Are certain words frequently seen together?
Could we pinpoint the ultimate worst spam email in the dataset?
Would it be feasible to employ principal component analysis to visualize this dataset effectively?

What’s the distribution of True and False answers

Code

(
    df.select(pl.col("yesno"))
    .to_series()
    .value_counts()
    .with_columns((pl.col("counts") / pl.sum("counts")).alias("percent_count"))
)

shape: (2, 3)

yesno	counts	percent_count
i8	u32	f64
0	2788	0.605955
1	1813	0.394045

Note

As you’ve observed, the dataset displays a slight imbalance, with spam emails constituting approximately 40% of the dataset.

What are the most frequently used words in spam emails?

Code

fig = (
    df.filter(pl.col("yesno") == 1)
    .select(cs.float())
    .melt()
    .groupby("variable")
    .agg(
        pl.col("value").mean().alias("mean"),
    )
    .sort(by="mean")
    .pipe(
        lambda df: px.bar(
            df,
            x="variable",
            y="mean",
            color="variable",
            width=600,
            height=500,
            labels={
                "mean": "Average % occurrence per total word count",
                "variable": "",
            },
        )
    )
)
fig.update_traces(hovertemplate="<br>Occurence: %{y:.2f}")  #

Figure 1: Most Commonly Used Terms in Spam Messages

Note

Among the frequently used words, “bang” takes a noticeable spot, appearing on average in 0.51% of the total word count.

What are the relative frequency distribution differences of the words associated with spam?

Code

fig = (
    df.select(pl.all().exclude("crl.tot"))
    .groupby("yesno")
    .agg(
        pl.all().mean(),
    )
    .to_pandas()
    .apply(lambda x: x / x.sum(), axis=0)
    .melt(id_vars="yesno")
    .rename(columns={"yesno": "spam"})
    .assign(spam=lambda df: df.spam.map({0: "Not a spam", 1: "Spam"}))
    .assign(percentage=lambda x: ["{:.2%}".format(val) for val in x["value"]])
    .pipe(
        lambda df: px.bar(
            df,
            x="variable",
            y="value",
            color="spam",
            labels={
                "value": "Relative frequency of words compared to class ",
                "variable": "",
            },
            text="percentage",
            width=700,
            height=600,
        )
    )
)

fig.update_layout(
    showlegend=True, uniformtext_minsize=8, uniformtext_mode="hide", hovermode=False
)
fig.update_traces(
    textposition="inside",
)
fig.update_yaxes(tickformat=".0%")

Figure 2: Differences in Relative Frequency Distribution of Words Associated with Spam

Note

The word “n000” is 36x more likely to appear in a spam email.

What’s the overall correlation

Code

fig = (
    df.to_pandas()
    .corr()
    .round(2)
    .pipe(
        lambda df: px.imshow(
            df,
            text_auto=True,
            aspect="equal",
            zmin=-1,
            zmax=1,
            color_continuous_scale="RdBu_r",
            # origin='lower',
            width=600,
            height=600,
        )
    )
)
fig.update_xaxes(side="top")

Figure 3: Correlation Across All Terms

Note

Our target variable, “yesno,” exhibits the strongest moderate positive correlation with the occurrence of “n000.”

What is the worst spam email?

To identify the most significant culprit within the dataset, we’ll sum up the numerical percentage values and then sort the values accordingly.

Code

(
    df.filter(pl.col("yesno") == 1)
    .with_columns(pl.sum_horizontal(pl.all().exclude("crl.tot", "yesno")))
    .sort(by="sum", descending=True)
    .head()
)

shape: (5, 8)

crl.tot	dollar	bang	money	n000	make	yesno	sum
i16	f32	f32	f32	f32	f32	i8	f32
10	0.0	3.076	9.09	0.0	4.54	1	16.706001
2	0.0	0.0	12.5	0.0	0.0	1	12.5
344	5.3	0.904	0.82	0.41	1.24	1	8.674
344	5.3	0.904	0.82	0.41	1.24	1	8.674
15	0.0	7.843	0.0	0.0	0.0	1	7.843

Note

The most concerning spam email in our dataset showcases a total occurrence of 3% for the word “bang,” 9% for “money,” and 4.5% for “make.” Yikes.

Visualize data with PCA

To enhance our grasp of the data, we can employ Principal Component Analysis (PCA). This technique comes in handy when we’re dealing with multiple independent variables, allowing us to reduce dimensionality and streamline the dataset.

Visualize all the principal components

Code

X = df.select(pl.all().exclude("yesno")).to_pandas()
y = df.select(pl.col("yesno")).to_pandas()

Code

# Before PCA we need to scale and transform our dataset

scaler = preprocessing.PowerTransformer().set_output(transform="pandas")

X = scaler.fit_transform(X)

Code

pca = decomposition.PCA()
components = pca.fit_transform(X)
labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    components,
    opacity=0.2,
    labels=labels,
    dimensions=range(6),
    color=y.squeeze().map({0: "Not a Spam", 1: "Spam"}),
    width=1000,
    height=1000,
)
fig.update_traces(diagonal_visible=False)
fig.show()

Figure 4: Visualization of All Principal Components

Note

As evident, the first and second principal components manage to account for 62% of the variance within the data. A distinct separation of classes is noticeable, particularly with the spam group exhibiting a broader distribution, indicating greater diversity.

Visualize Loadings

For a deeper comprehension of how each characteristic influences our principal components, we can delve into examining the loadings.

Code

pca = decomposition.PCA(n_components=2)
components = pca.fit_transform(X)
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)


fig = px.scatter(
    components,
    x=0,
    y=1,
    color=y.squeeze().map({0: "Not a Spam", 1: "Spam"}),
    labels={
        "0": f"PC1 ({pca.explained_variance_ratio_[0]:.0%})",
        "1": f"PC2 ({pca.explained_variance_ratio_[1]:.0%})",
    },
    template="plotly_dark",
    color_discrete_sequence=[
        "red",
        "green",
    ],
    opacity=0.4,
    width=700,
    height=700,
)
for i, feature in enumerate(X.columns):
    fig.add_annotation(
        ax=0,
        ay=0,
        axref="x",
        ayref="y",
        x=loadings[i, 0],
        y=loadings[i, 1],
        showarrow=True,
        arrowsize=1,
        arrowhead=2,
        xanchor="right",
        yanchor="top",
    )
    fig.add_annotation(
        x=loadings[i, 0],
        y=loadings[i, 1],
        ax=0,
        ay=0,
        font=dict(
            size=20,
            # color='yellow'
        ),
        xanchor="left",
        yanchor="bottom",
        text=feature,
        yshift=5,
    )
fig

Figure 5: Loadings Plot

Note

The loadings plot reveals a strong correlation among the variables “crl.tot,” “money,” and “n000”. If we were to construct a machine learning model for spam email recognition, we could opt for one of these variables to streamline our dataset. Furthermore, the disassociation of “bang” from this trio is evident, its vector positioned at a 90° angle.

Alright folks, our journey into the intriguing realm of spam emails has left us with some valuable insights. It’s clear that words like “bang” and “n000” are key indicators to watch out for. Luckily, modern machine learning models are here to do the hard work on our behalf.

Stay vigilant out there and take care! See you in the next week’s adventure! 👋📧🛡️

--- title: 'Tidy Tuesday: Spam E-mail' author: Adam Cseresznye date: '2023-08-15' categories: - Tidy Tuesday jupyter: python3 toc: true format: html: code-fold: true code-tools: true --- ![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](spam.jpg){fig-align="center" width=80%} Welcome to this week's Tidy Tuesday! Today, we're delving into the intriguing yet bothersome realm of email spams. You know, those unsolicited messages that flood our inboxes? They go by many names like junk email or spam mail, and they're sent in bulk. The term "spam" got its name from a Monty Python sketch where the word "Spam" was everywhere, just like these emails. Starting from the early 1990s, these spam messages have been on a steady rise, making up around 90% of all email traffic by 2014. We've got our data from the Tidy Tuesday treasure trove over at [GitHub](https://github.com/rfordatascience/tidytuesday/tree/master/data/2023/2023-08-15)! This dataset, from Vincent Arel-Bundock's Rdatasets package, was initially gathered at Hewlett-Packard Labs. They later kindly shared it with the UCI Machine Learning Repository. This treasure trove of information consists of 4601 emaily sorted into spam and non-spam categorie # Setup ```{python} #| tags: [] import polars as pl import matplotlib.pyplot as plt import seaborn as sns from scipy import stats import polars.selectors as cs import numpy as np from sklearn import preprocessing from sklearn import decomposition from lets_plot import * from lets_plot.mapping import as_discrete LetsPlot.setup_html() import plotly.io as pio import plotly.express as px pio.templates.default = "presentation" ``` ```{python} df = pl.read_csv( "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-15/spam.csv" ).with_columns( [ pl.col("crl.tot").cast(pl.Int16), pl.col("dollar").cast(pl.Float32), pl.col("bang").cast(pl.Float32), pl.col("money").cast(pl.Float32), pl.col("n000").cast(pl.Float32), pl.col("make").cast(pl.Float32), pl.col("yesno").map_dict({"y": 1, "n": 0}).cast(pl.Int8), ] ) df.sample(5) ``` Here is the data dictionary: | Variable | Class | Description | |-----------|------------|---------------------------------------------------------------------------------------------------| | crl.tot | double | Total length of uninterrupted sequences of capitals | | dollar | double | Occurrences of the dollar sign, as percent of total number of characters | | bang | double | Occurrences of ‘!’, as percent of total number of characters | | money | double | Occurrences of ‘money’, as percent of total number of characters | | n000 | double | Occurrences of the string ‘000’, as percent of total number of words | | make | double | Occurrences of ‘make’, as a percent of total number of words | | yesno | character | Outcome variable, a factor with levels 'n' not spam, 'y' spam |vely? **Looking at the data dictionary, a few intriguing questions arise:** * How is the distribution between True and False answers? Are non-spam emails significantly more prevalent? * What are the top words that frequently appear in spam emails? * Can we explore the relative frequency distribution disparities of words linked with spam? What's the increased likelihood of finding these specific words in a spam email versus a non-spam one? * Is there an overarching correlation within the dataset? Are certain words frequently seen together? * Could we pinpoint the ultimate worst spam email in the dataset? * Would it be feasible to employ principal component analysis to visualize this dataset effectively? # What's the distribution of True and False answers ```{python} ( df.select(pl.col("yesno")) .to_series() .value_counts() .with_columns((pl.col("counts") / pl.sum("counts")).alias("percent_count")) ) ``` ::: {.callout-note} As you've observed, the dataset displays a slight imbalance, with spam emails constituting approximately 40% of the dataset. ::: # What are the most frequently used words in spam emails? ```{python} #| fig-cap: Most Commonly Used Terms in Spam Messages #| label: fig-fig1 fig = ( df.filter(pl.col("yesno") == 1) .select(cs.float()) .melt() .groupby("variable") .agg( pl.col("value").mean().alias("mean"), ) .sort(by="mean") .pipe( lambda df: px.bar( df, x="variable", y="mean", color="variable", width=600, height=500, labels={ "mean": "Average % occurrence per total word count", "variable": "", }, ) ) ) fig.update_traces(hovertemplate="<br>Occurence: %{y:.2f}") # ``` ::: {.callout-note} Among the frequently used words, "bang" takes a noticeable spot, appearing on average in 0.51% of the total word count. ::: # What are the relative frequency distribution differences of the words associated with spam? ```{python} #| fig-cap: Differences in Relative Frequency Distribution of Words Associated with Spam #| label: fig-fig2 fig = ( df.select(pl.all().exclude("crl.tot")) .groupby("yesno") .agg( pl.all().mean(), ) .to_pandas() .apply(lambda x: x / x.sum(), axis=0) .melt(id_vars="yesno") .rename(columns={"yesno": "spam"}) .assign(spam=lambda df: df.spam.map({0: "Not a spam", 1: "Spam"})) .assign(percentage=lambda x: ["{:.2%}".format(val) for val in x["value"]]) .pipe( lambda df: px.bar( df, x="variable", y="value", color="spam", labels={ "value": "Relative frequency of words compared to class ", "variable": "", }, text="percentage", width=700, height=600, ) ) ) fig.update_layout( showlegend=True, uniformtext_minsize=8, uniformtext_mode="hide", hovermode=False ) fig.update_traces( textposition="inside", ) fig.update_yaxes(tickformat=".0%") ``` ::: {.callout-note} The word "n000" is 36x more likely to appear in a spam email. ::: # What's the overall correlation ```{python} #| fig-cap: Correlation Across All Terms #| label: fig-fig3 fig = ( df.to_pandas() .corr() .round(2) .pipe( lambda df: px.imshow( df, text_auto=True, aspect="equal", zmin=-1, zmax=1, color_continuous_scale="RdBu_r", # origin='lower', width=600, height=600, ) ) ) fig.update_xaxes(side="top") ``` ::: {.callout-note} Our target variable, "yesno," exhibits the strongest moderate positive correlation with the occurrence of "n000." ::: # What is the worst spam email? To identify the most significant culprit within the dataset, we'll sum up the numerical percentage values and then sort the values accordingly. ```{python} ( df.filter(pl.col("yesno") == 1) .with_columns(pl.sum_horizontal(pl.all().exclude("crl.tot", "yesno"))) .sort(by="sum", descending=True) .head() ) ``` ::: {.callout-note} The most concerning spam email in our dataset showcases a total occurrence of 3% for the word "bang," 9% for "money," and 4.5% for "make." Yikes. ::: # Visualize data with PCA To enhance our grasp of the data, we can employ Principal Component Analysis (PCA). This technique comes in handy when we're dealing with multiple independent variables, allowing us to reduce dimensionality and streamline the dataset. ## Visualize all the principal components ```{python} X = df.select(pl.all().exclude("yesno")).to_pandas() y = df.select(pl.col("yesno")).to_pandas() ``` ```{python} # Before PCA we need to scale and transform our dataset scaler = preprocessing.PowerTransformer().set_output(transform="pandas") X = scaler.fit_transform(X) ``` ```{python} #| fig-cap: Visualization of All Principal Components #| label: fig-fig4 pca = decomposition.PCA() components = pca.fit_transform(X) labels = { str(i): f"PC {i+1} ({var:.1f}%)" for i, var in enumerate(pca.explained_variance_ratio_ * 100) } fig = px.scatter_matrix( components, opacity=0.2, labels=labels, dimensions=range(6), color=y.squeeze().map({0: "Not a Spam", 1: "Spam"}), width=1000, height=1000, ) fig.update_traces(diagonal_visible=False) fig.show() ``` ::: {.callout-note} As evident, the first and second principal components manage to account for 62% of the variance within the data. A distinct separation of classes is noticeable, particularly with the spam group exhibiting a broader distribution, indicating greater diversity. ::: ## Visualize Loadings For a deeper comprehension of how each characteristic influences our principal components, we can delve into examining the loadings. ```{python} #| fig-cap: Loadings Plot #| label: fig-fig5 pca = decomposition.PCA(n_components=2) components = pca.fit_transform(X) loadings = pca.components_.T * np.sqrt(pca.explained_variance_) fig = px.scatter( components, x=0, y=1, color=y.squeeze().map({0: "Not a Spam", 1: "Spam"}), labels={ "0": f"PC1 ({pca.explained_variance_ratio_[0]:.0%})", "1": f"PC2 ({pca.explained_variance_ratio_[1]:.0%})", }, template="plotly_dark", color_discrete_sequence=[ "red", "green", ], opacity=0.4, width=700, height=700, ) for i, feature in enumerate(X.columns): fig.add_annotation( ax=0, ay=0, axref="x", ayref="y", x=loadings[i, 0], y=loadings[i, 1], showarrow=True, arrowsize=1, arrowhead=2, xanchor="right", yanchor="top", ) fig.add_annotation( x=loadings[i, 0], y=loadings[i, 1], ax=0, ay=0, font=dict( size=20, # color='yellow' ), xanchor="left", yanchor="bottom", text=feature, yshift=5, ) fig ``` ::: {.callout-note} The loadings plot reveals a strong correlation among the variables "crl.tot," "money," and "n000". If we were to construct a machine learning model for spam email recognition, we could opt for one of these variables to streamline our dataset. Furthermore, the disassociation of "bang" from this trio is evident, its vector positioned at a 90° angle. ::: Alright folks, our journey into the intriguing realm of spam emails has left us with some valuable insights. It's clear that words like "bang" and "n000" are key indicators to watch out for. Luckily, modern machine learning models are here to do the hard work on our behalf. Stay vigilant out there and take care! See you in the next week's adventure! 👋📧🛡️