Welcome to this week’s Tidy Tuesday! Today, we’re delving into the intriguing yet bothersome realm of email spams. You know, those unsolicited messages that flood our inboxes? They go by many names like junk email or spam mail, and they’re sent in bulk. The term “spam” got its name from a Monty Python sketch where the word “Spam” was everywhere, just like these emails. Starting from the early 1990s, these spam messages have been on a steady rise, making up around 90% of all email traffic by 2014.
We’ve got our data from the Tidy Tuesday treasure trove over at GitHub! This dataset, from Vincent Arel-Bundock’s Rdatasets package, was initially gathered at Hewlett-Packard Labs. They later kindly shared it with the UCI Machine Learning Repository.
This treasure trove of information consists of 4601 emaily sorted into spam and non-spam categorie
Setup
Code
import polars as plimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy import statsimport polars.selectors as csimport numpy as npfrom sklearn import preprocessingfrom sklearn import decompositionfrom lets_plot import*from lets_plot.mapping import as_discreteLetsPlot.setup_html()import plotly.io as pioimport plotly.express as pxpio.templates.default ="presentation"
Total length of uninterrupted sequences of capitals
dollar
double
Occurrences of the dollar sign, as percent of total number of characters
bang
double
Occurrences of ‘!’, as percent of total number of characters
money
double
Occurrences of ‘money’, as percent of total number of characters
n000
double
Occurrences of the string ‘000’, as percent of total number of words
make
double
Occurrences of ‘make’, as a percent of total number of words
yesno
character
Outcome variable, a factor with levels ‘n’ not spam, ‘y’ spam
Looking at the data dictionary, a few intriguing questions arise:
How is the distribution between True and False answers? Are non-spam emails significantly more prevalent?
What are the top words that frequently appear in spam emails?
Can we explore the relative frequency distribution disparities of words linked with spam? What’s the increased likelihood of finding these specific words in a spam email versus a non-spam one?
Is there an overarching correlation within the dataset? Are certain words frequently seen together?
Could we pinpoint the ultimate worst spam email in the dataset?
Would it be feasible to employ principal component analysis to visualize this dataset effectively?
The most concerning spam email in our dataset showcases a total occurrence of 3% for the word “bang,” 9% for “money,” and 4.5% for “make.” Yikes.
Visualize data with PCA
To enhance our grasp of the data, we can employ Principal Component Analysis (PCA). This technique comes in handy when we’re dealing with multiple independent variables, allowing us to reduce dimensionality and streamline the dataset.
Visualize all the principal components
Code
X = df.select(pl.all().exclude("yesno")).to_pandas()y = df.select(pl.col("yesno")).to_pandas()
Code
# Before PCA we need to scale and transform our datasetscaler = preprocessing.PowerTransformer().set_output(transform="pandas")X = scaler.fit_transform(X)
Code
pca = decomposition.PCA()components = pca.fit_transform(X)labels = {str(i): f"PC {i+1} ({var:.1f}%)"for i, var inenumerate(pca.explained_variance_ratio_ *100)}fig = px.scatter_matrix( components, opacity=0.2, labels=labels, dimensions=range(6), color=y.squeeze().map({0: "Not a Spam", 1: "Spam"}), width=1000, height=1000,)fig.update_traces(diagonal_visible=False)fig.show()
Note
As evident, the first and second principal components manage to account for 62% of the variance within the data. A distinct separation of classes is noticeable, particularly with the spam group exhibiting a broader distribution, indicating greater diversity.
Visualize Loadings
For a deeper comprehension of how each characteristic influences our principal components, we can delve into examining the loadings.
The loadings plot reveals a strong correlation among the variables “crl.tot,” “money,” and “n000”. If we were to construct a machine learning model for spam email recognition, we could opt for one of these variables to streamline our dataset. Furthermore, the disassociation of “bang” from this trio is evident, its vector positioned at a 90° angle.
Alright folks, our journey into the intriguing realm of spam emails has left us with some valuable insights. It’s clear that words like “bang” and “n000” are key indicators to watch out for. Luckily, modern machine learning models are here to do the hard work on our behalf.
Stay vigilant out there and take care! See you in the next week’s adventure! 👋📧🛡️
Source Code
---title: 'Tidy Tuesday: Spam E-mail'author: Adam Cseresznyedate: '2023-08-15'categories: - Tidy Tuesdayjupyter: python3toc: trueformat: html: code-fold: true code-tools: true---![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](spam.jpg){fig-align="center" width=80%}Welcome to this week's Tidy Tuesday! Today, we're delving into the intriguing yet bothersome realm of email spams. You know, those unsolicited messages that flood our inboxes? They go by many names like junk email or spam mail, and they're sent in bulk. The term "spam" got its name from a Monty Python sketch where the word "Spam" was everywhere, just like these emails. Starting from the early 1990s, these spam messages have been on a steady rise, making up around 90% of all email traffic by 2014.We've got our data from the Tidy Tuesday treasure trove over at [GitHub](https://github.com/rfordatascience/tidytuesday/tree/master/data/2023/2023-08-15)! This dataset, from Vincent Arel-Bundock's Rdatasets package, was initially gathered at Hewlett-Packard Labs. They later kindly shared it with the UCI Machine Learning Repository. This treasure trove of information consists of 4601 emaily sorted into spam and non-spam categorie# Setup```{python}#| tags: []import polars as plimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy import statsimport polars.selectors as csimport numpy as npfrom sklearn import preprocessingfrom sklearn import decompositionfrom lets_plot import*from lets_plot.mapping import as_discreteLetsPlot.setup_html()import plotly.io as pioimport plotly.express as pxpio.templates.default ="presentation"``````{python}df = pl.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-15/spam.csv").with_columns( [ pl.col("crl.tot").cast(pl.Int16), pl.col("dollar").cast(pl.Float32), pl.col("bang").cast(pl.Float32), pl.col("money").cast(pl.Float32), pl.col("n000").cast(pl.Float32), pl.col("make").cast(pl.Float32), pl.col("yesno").map_dict({"y": 1, "n": 0}).cast(pl.Int8), ])df.sample(5)```Here is the data dictionary:| Variable | Class | Description ||-----------|------------|---------------------------------------------------------------------------------------------------|| crl.tot | double | Total length of uninterrupted sequences of capitals || dollar | double | Occurrences of the dollar sign, as percent of total number of characters || bang | double | Occurrences of ‘!’, as percent of total number of characters || money | double | Occurrences of ‘money’, as percent of total number of characters || n000 | double | Occurrences of the string ‘000’, as percent of total number of words || make | double | Occurrences of ‘make’, as a percent of total number of words || yesno | character | Outcome variable, a factor with levels 'n' not spam, 'y' spam |vely?**Looking at the data dictionary, a few intriguing questions arise:*** How is the distribution between True and False answers? Are non-spam emails significantly more prevalent?* What are the top words that frequently appear in spam emails?* Can we explore the relative frequency distribution disparities of words linked with spam? What's the increased likelihood of finding these specific words in a spam email versus a non-spam one?* Is there an overarching correlation within the dataset? Are certain words frequently seen together?* Could we pinpoint the ultimate worst spam email in the dataset?* Would it be feasible to employ principal component analysis to visualize this dataset effectively?# What's the distribution of True and False answers```{python}( df.select(pl.col("yesno")) .to_series() .value_counts() .with_columns((pl.col("counts") / pl.sum("counts")).alias("percent_count")))```::: {.callout-note}As you've observed, the dataset displays a slight imbalance, with spam emails constituting approximately 40% of the dataset.:::# What are the most frequently used words in spam emails?```{python}#| fig-cap: Most Commonly Used Terms in Spam Messages#| label: fig-fig1fig = ( df.filter(pl.col("yesno") ==1) .select(cs.float()) .melt() .groupby("variable") .agg( pl.col("value").mean().alias("mean"), ) .sort(by="mean") .pipe(lambda df: px.bar( df, x="variable", y="mean", color="variable", width=600, height=500, labels={"mean": "Average % occurrence per total word count","variable": "", }, ) ))fig.update_traces(hovertemplate="<br>Occurence: %{y:.2f}") #```::: {.callout-note}Among the frequently used words, "bang" takes a noticeable spot, appearing on average in 0.51% of the total word count.:::# What are the relative frequency distribution differences of the words associated with spam?```{python}#| fig-cap: Differences in Relative Frequency Distribution of Words Associated with Spam#| label: fig-fig2fig = ( df.select(pl.all().exclude("crl.tot")) .groupby("yesno") .agg( pl.all().mean(), ) .to_pandas() .apply(lambda x: x / x.sum(), axis=0) .melt(id_vars="yesno") .rename(columns={"yesno": "spam"}) .assign(spam=lambda df: df.spam.map({0: "Not a spam", 1: "Spam"})) .assign(percentage=lambda x: ["{:.2%}".format(val) for val in x["value"]]) .pipe(lambda df: px.bar( df, x="variable", y="value", color="spam", labels={"value": "Relative frequency of words compared to class ","variable": "", }, text="percentage", width=700, height=600, ) ))fig.update_layout( showlegend=True, uniformtext_minsize=8, uniformtext_mode="hide", hovermode=False)fig.update_traces( textposition="inside",)fig.update_yaxes(tickformat=".0%")```::: {.callout-note}The word "n000" is 36x more likely to appear in a spam email.:::# What's the overall correlation```{python}#| fig-cap: Correlation Across All Terms#| label: fig-fig3fig = ( df.to_pandas() .corr() .round(2) .pipe(lambda df: px.imshow( df, text_auto=True, aspect="equal", zmin=-1, zmax=1, color_continuous_scale="RdBu_r",# origin='lower', width=600, height=600, ) ))fig.update_xaxes(side="top")```::: {.callout-note}Our target variable, "yesno," exhibits the strongest moderate positive correlation with the occurrence of "n000.":::# What is the worst spam email?To identify the most significant culprit within the dataset, we'll sum up the numerical percentage values and then sort the values accordingly.```{python}( df.filter(pl.col("yesno") ==1) .with_columns(pl.sum_horizontal(pl.all().exclude("crl.tot", "yesno"))) .sort(by="sum", descending=True) .head())```::: {.callout-note}The most concerning spam email in our dataset showcases a total occurrence of 3% for the word "bang," 9% for "money," and 4.5% for "make." Yikes. :::# Visualize data with PCATo enhance our grasp of the data, we can employ Principal Component Analysis (PCA). This technique comes in handy when we're dealing with multiple independent variables, allowing us to reduce dimensionality and streamline the dataset.## Visualize all the principal components```{python}X = df.select(pl.all().exclude("yesno")).to_pandas()y = df.select(pl.col("yesno")).to_pandas()``````{python}# Before PCA we need to scale and transform our datasetscaler = preprocessing.PowerTransformer().set_output(transform="pandas")X = scaler.fit_transform(X)``````{python}#| fig-cap: Visualization of All Principal Components#| label: fig-fig4pca = decomposition.PCA()components = pca.fit_transform(X)labels = {str(i): f"PC {i+1} ({var:.1f}%)"for i, var inenumerate(pca.explained_variance_ratio_ *100)}fig = px.scatter_matrix( components, opacity=0.2, labels=labels, dimensions=range(6), color=y.squeeze().map({0: "Not a Spam", 1: "Spam"}), width=1000, height=1000,)fig.update_traces(diagonal_visible=False)fig.show()```::: {.callout-note}As evident, the first and second principal components manage to account for 62% of the variance within the data. A distinct separation of classes is noticeable, particularly with the spam group exhibiting a broader distribution, indicating greater diversity.:::## Visualize LoadingsFor a deeper comprehension of how each characteristic influences our principal components, we can delve into examining the loadings.```{python}#| fig-cap: Loadings Plot#| label: fig-fig5pca = decomposition.PCA(n_components=2)components = pca.fit_transform(X)loadings = pca.components_.T * np.sqrt(pca.explained_variance_)fig = px.scatter( components, x=0, y=1, color=y.squeeze().map({0: "Not a Spam", 1: "Spam"}), labels={"0": f"PC1 ({pca.explained_variance_ratio_[0]:.0%})","1": f"PC2 ({pca.explained_variance_ratio_[1]:.0%})", }, template="plotly_dark", color_discrete_sequence=["red","green", ], opacity=0.4, width=700, height=700,)for i, feature inenumerate(X.columns): fig.add_annotation( ax=0, ay=0, axref="x", ayref="y", x=loadings[i, 0], y=loadings[i, 1], showarrow=True, arrowsize=1, arrowhead=2, xanchor="right", yanchor="top", ) fig.add_annotation( x=loadings[i, 0], y=loadings[i, 1], ax=0, ay=0, font=dict( size=20,# color='yellow' ), xanchor="left", yanchor="bottom", text=feature, yshift=5, )fig```::: {.callout-note}The loadings plot reveals a strong correlation among the variables "crl.tot," "money," and "n000". If we were to construct a machine learning model for spam email recognition, we could opt for one of these variables to streamline our dataset. Furthermore, the disassociation of "bang" from this trio is evident, its vector positioned at a 90° angle.:::Alright folks, our journey into the intriguing realm of spam emails has left us with some valuable insights. It's clear that words like "bang" and "n000" are key indicators to watch out for. Luckily, modern machine learning models are here to do the hard work on our behalf.Stay vigilant out there and take care! See you in the next week's adventure! 👋📧🛡️