Tidy Tuesday: Spam E-mail

Tidy Tuesday
Author

Adam Cseresznye

Published

August 15, 2023

Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

Welcome to this week’s Tidy Tuesday! Today, we’re delving into the intriguing yet bothersome realm of email spams. You know, those unsolicited messages that flood our inboxes? They go by many names like junk email or spam mail, and they’re sent in bulk. The term “spam” got its name from a Monty Python sketch where the word “Spam” was everywhere, just like these emails. Starting from the early 1990s, these spam messages have been on a steady rise, making up around 90% of all email traffic by 2014.

We’ve got our data from the Tidy Tuesday treasure trove over at GitHub! This dataset, from Vincent Arel-Bundock’s Rdatasets package, was initially gathered at Hewlett-Packard Labs. They later kindly shared it with the UCI Machine Learning Repository.

This treasure trove of information consists of 4601 emaily sorted into spam and non-spam categorie

Setup

Code
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import polars.selectors as cs
import numpy as np
from sklearn import preprocessing
from sklearn import decomposition


from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

import plotly.io as pio
import plotly.express as px

pio.templates.default = "presentation"
Code
df = pl.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-15/spam.csv"
).with_columns(
    [
        pl.col("crl.tot").cast(pl.Int16),
        pl.col("dollar").cast(pl.Float32),
        pl.col("bang").cast(pl.Float32),
        pl.col("money").cast(pl.Float32),
        pl.col("n000").cast(pl.Float32),
        pl.col("make").cast(pl.Float32),
        pl.col("yesno").map_dict({"y": 1, "n": 0}).cast(pl.Int8),
    ]
)

df.sample(5)
shape: (5, 7)
crl.tot dollar bang money n000 make yesno
i16 f32 f32 f32 f32 f32 i8
5 0.0 0.0 0.0 0.0 0.0 0
57 0.0 0.374 0.0 0.0 0.0 0
75 0.0 0.0 0.0 0.0 0.0 0
23 0.0 0.0 0.0 0.0 0.0 0
65 0.0 0.0 0.0 0.0 0.0 0

Here is the data dictionary:

Variable Class Description
crl.tot double Total length of uninterrupted sequences of capitals
dollar double Occurrences of the dollar sign, as percent of total number of characters
bang double Occurrences of ‘!’, as percent of total number of characters
money double Occurrences of ‘money’, as percent of total number of characters
n000 double Occurrences of the string ‘000’, as percent of total number of words
make double Occurrences of ‘make’, as a percent of total number of words
yesno character Outcome variable, a factor with levels ‘n’ not spam, ‘y’ spam

Looking at the data dictionary, a few intriguing questions arise:

  • How is the distribution between True and False answers? Are non-spam emails significantly more prevalent?
  • What are the top words that frequently appear in spam emails?
  • Can we explore the relative frequency distribution disparities of words linked with spam? What’s the increased likelihood of finding these specific words in a spam email versus a non-spam one?
  • Is there an overarching correlation within the dataset? Are certain words frequently seen together?
  • Could we pinpoint the ultimate worst spam email in the dataset?
  • Would it be feasible to employ principal component analysis to visualize this dataset effectively?

What’s the distribution of True and False answers

Code
(
    df.select(pl.col("yesno"))
    .to_series()
    .value_counts()
    .with_columns((pl.col("counts") / pl.sum("counts")).alias("percent_count"))
)
shape: (2, 3)
yesno counts percent_count
i8 u32 f64
0 2788 0.605955
1 1813 0.394045
Note

As you’ve observed, the dataset displays a slight imbalance, with spam emails constituting approximately 40% of the dataset.

What are the most frequently used words in spam emails?

Code
fig = (
    df.filter(pl.col("yesno") == 1)
    .select(cs.float())
    .melt()
    .groupby("variable")
    .agg(
        pl.col("value").mean().alias("mean"),
    )
    .sort(by="mean")
    .pipe(
        lambda df: px.bar(
            df,
            x="variable",
            y="mean",
            color="variable",
            width=600,
            height=500,
            labels={
                "mean": "Average % occurrence per total word count",
                "variable": "",
            },
        )
    )
)
fig.update_traces(hovertemplate="<br>Occurence: %{y:.2f}")  #
Figure 1: Most Commonly Used Terms in Spam Messages
Note

Among the frequently used words, “bang” takes a noticeable spot, appearing on average in 0.51% of the total word count.

What are the relative frequency distribution differences of the words associated with spam?

Code
fig = (
    df.select(pl.all().exclude("crl.tot"))
    .groupby("yesno")
    .agg(
        pl.all().mean(),
    )
    .to_pandas()
    .apply(lambda x: x / x.sum(), axis=0)
    .melt(id_vars="yesno")
    .rename(columns={"yesno": "spam"})
    .assign(spam=lambda df: df.spam.map({0: "Not a spam", 1: "Spam"}))
    .assign(percentage=lambda x: ["{:.2%}".format(val) for val in x["value"]])
    .pipe(
        lambda df: px.bar(
            df,
            x="variable",
            y="value",
            color="spam",
            labels={
                "value": "Relative frequency of words compared to class ",
                "variable": "",
            },
            text="percentage",
            width=700,
            height=600,
        )
    )
)

fig.update_layout(
    showlegend=True, uniformtext_minsize=8, uniformtext_mode="hide", hovermode=False
)
fig.update_traces(
    textposition="inside",
)
fig.update_yaxes(tickformat=".0%")
Figure 2: Differences in Relative Frequency Distribution of Words Associated with Spam
Note

The word “n000” is 36x more likely to appear in a spam email.

What’s the overall correlation

Code
fig = (
    df.to_pandas()
    .corr()
    .round(2)
    .pipe(
        lambda df: px.imshow(
            df,
            text_auto=True,
            aspect="equal",
            zmin=-1,
            zmax=1,
            color_continuous_scale="RdBu_r",
            # origin='lower',
            width=600,
            height=600,
        )
    )
)
fig.update_xaxes(side="top")
Figure 3: Correlation Across All Terms
Note

Our target variable, “yesno,” exhibits the strongest moderate positive correlation with the occurrence of “n000.”

What is the worst spam email?

To identify the most significant culprit within the dataset, we’ll sum up the numerical percentage values and then sort the values accordingly.

Code
(
    df.filter(pl.col("yesno") == 1)
    .with_columns(pl.sum_horizontal(pl.all().exclude("crl.tot", "yesno")))
    .sort(by="sum", descending=True)
    .head()
)
shape: (5, 8)
crl.tot dollar bang money n000 make yesno sum
i16 f32 f32 f32 f32 f32 i8 f32
10 0.0 3.076 9.09 0.0 4.54 1 16.706001
2 0.0 0.0 12.5 0.0 0.0 1 12.5
344 5.3 0.904 0.82 0.41 1.24 1 8.674
344 5.3 0.904 0.82 0.41 1.24 1 8.674
15 0.0 7.843 0.0 0.0 0.0 1 7.843
Note

The most concerning spam email in our dataset showcases a total occurrence of 3% for the word “bang,” 9% for “money,” and 4.5% for “make.” Yikes.

Visualize data with PCA

To enhance our grasp of the data, we can employ Principal Component Analysis (PCA). This technique comes in handy when we’re dealing with multiple independent variables, allowing us to reduce dimensionality and streamline the dataset.

Visualize all the principal components

Code
X = df.select(pl.all().exclude("yesno")).to_pandas()
y = df.select(pl.col("yesno")).to_pandas()
Code
# Before PCA we need to scale and transform our dataset

scaler = preprocessing.PowerTransformer().set_output(transform="pandas")

X = scaler.fit_transform(X)
Code
pca = decomposition.PCA()
components = pca.fit_transform(X)
labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    components,
    opacity=0.2,
    labels=labels,
    dimensions=range(6),
    color=y.squeeze().map({0: "Not a Spam", 1: "Spam"}),
    width=1000,
    height=1000,
)
fig.update_traces(diagonal_visible=False)
fig.show()
Figure 4: Visualization of All Principal Components
Note

As evident, the first and second principal components manage to account for 62% of the variance within the data. A distinct separation of classes is noticeable, particularly with the spam group exhibiting a broader distribution, indicating greater diversity.

Visualize Loadings

For a deeper comprehension of how each characteristic influences our principal components, we can delve into examining the loadings.

Code
pca = decomposition.PCA(n_components=2)
components = pca.fit_transform(X)
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)


fig = px.scatter(
    components,
    x=0,
    y=1,
    color=y.squeeze().map({0: "Not a Spam", 1: "Spam"}),
    labels={
        "0": f"PC1 ({pca.explained_variance_ratio_[0]:.0%})",
        "1": f"PC2 ({pca.explained_variance_ratio_[1]:.0%})",
    },
    template="plotly_dark",
    color_discrete_sequence=[
        "red",
        "green",
    ],
    opacity=0.4,
    width=700,
    height=700,
)
for i, feature in enumerate(X.columns):
    fig.add_annotation(
        ax=0,
        ay=0,
        axref="x",
        ayref="y",
        x=loadings[i, 0],
        y=loadings[i, 1],
        showarrow=True,
        arrowsize=1,
        arrowhead=2,
        xanchor="right",
        yanchor="top",
    )
    fig.add_annotation(
        x=loadings[i, 0],
        y=loadings[i, 1],
        ax=0,
        ay=0,
        font=dict(
            size=20,
            # color='yellow'
        ),
        xanchor="left",
        yanchor="bottom",
        text=feature,
        yshift=5,
    )
fig
Figure 5: Loadings Plot
Note

The loadings plot reveals a strong correlation among the variables “crl.tot,” “money,” and “n000”. If we were to construct a machine learning model for spam email recognition, we could opt for one of these variables to streamline our dataset. Furthermore, the disassociation of “bang” from this trio is evident, its vector positioned at a 90° angle.

Alright folks, our journey into the intriguing realm of spam emails has left us with some valuable insights. It’s clear that words like “bang” and “n000” are key indicators to watch out for. Luckily, modern machine learning models are here to do the hard work on our behalf.

Stay vigilant out there and take care! See you in the next week’s adventure! 👋📧🛡️