Code
import sys
from pathlib import Path
str(Path.cwd())) sys.path.append(
Adam Cseresznye
November 23, 2024
In Part 3, we looked at the significance of features in the initial scraped dataset using both the feature_importances_
method of CatBoostRegressor and SHAP values. We conducted feature elimination based on their importance and predictive capability.
In this upcoming section, we’ll implement a robust cross-validation strategy to accurately and consistently evaluate our model’s performance across multiple folds of the data. We will also identify and address potential outliers within our dataset, which is crucial to prevent their influence on the model’s predictions.
Additionally, we’ll further refine and expand our feature engineering efforts by exploring new methodologies to create informative features that increase our model’s predictive capabilities. Looking forward to these steps!
You can explore the project’s app on its website. For more details, visit the GitHub repository.
Check out the series for a deeper dive: - Part 1: Characterizing the Data - Part 2: Building a Baseline Model - Part 3: Feature Selection - Part 4: Feature Engineering - Part 5: Fine-Tuning
import gc
import itertools
from pathlib import Path
from typing import List, Optional, Tuple
import catboost
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import clear_output
from lets_plot import *
from lets_plot.mapping import as_discrete
from sklearn import (
cluster,
compose,
ensemble,
impute,
metrics,
model_selection,
neighbors,
pipeline,
preprocessing,
)
from sklearn.base import BaseEstimator, TransformerMixin
from tqdm.notebook import tqdm
from helper import feature_engineering, pre_process, train_model, utils
LetsPlot.setup_html()
Drawing from our findings in part 3, particularly with regards to our initial feature reduction efforts, we’ve developed a function named prepare_data_for_modelling
. This function resides in the pre_process.py
file, ensuring its reusability. The function performs essential data preprocessing steps, which include:
Let’s dive into the details of this function and prepare our X and y for the subsequent processing pipeline.
def prepare_data(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
"""
Prepare data for machine learning modeling.
This function takes a DataFrame and prepares it for machine learning by performing the following steps:
1. Randomly shuffles the rows of the DataFrame.
2. Converts the 'price' column to the base 10 logarithm.
3. Fills missing values in categorical variables with 'missing value'.
4. Separates the features (X) and the target (y).
Parameters:
- df (pd.DataFrame): The input DataFrame containing the dataset.
Returns:
- Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y).
Example use case:
# Load your dataset into a DataFrame (e.g., df)
df = load_data()
# Prepare the data for modeling
X, y = prepare_data_for_modelling(df)
# Now you can use X and y for machine learning tasks.
"""
processed_df = (
df.sample(frac=1, random_state=utils.Configuration.seed)
.reset_index(drop=True)
.assign(price=lambda df: np.log10(df.price))
)
# Fill missing categorical variables with "missing value"
for col in processed_df.columns:
if processed_df[col].dtype.name in ("bool", "object", "category"):
processed_df[col] = processed_df[col].fillna("missing value")
# Separate features (X) and target (y)
X = processed_df.loc[:, utils.Configuration.features_to_keep_v1]
y = processed_df[utils.Configuration.target_col]
print(f"Shape of X and y: {X.shape}, {y.shape}")
return X, y
Our next critical step is to establish a well-structured cross-validation strategy. This step is imperative as it enables us to assess the effectiveness of various feature engineering approaches without risking overfitting our model. A robust cross-validation strategy ensures that our model’s performance evaluations are reliable and that the insights gained are generalizable to new data. To accomplish this, we will employ RepeatedKFold
validation, setting the parameters with n_splits as 10 and n_repeats as 1.
In essence, this configuration signifies that we will perform a 10-fold cross-validation, and this entire process will be repeated once. Importantly, due to the modular nature of this function, we retain the flexibility to easily adapt and alter the design of our cross-validation strategy as needed.
def run_catboost_CV(
X: pd.DataFrame,
y: pd.Series,
n_splits: int = 10,
n_repeats: int = 1,
pipeline: Optional[object] = None,
) -> Tuple[float, float]:
"""
Perform Cross-Validation with CatBoost for regression.
This function conducts Cross-Validation using CatBoost for regression tasks. It iterates
through folds, trains CatBoost models, and computes the mean and standard deviation of the
Root Mean Squared Error (RMSE) scores across folds.
Parameters:
- X (pd.DataFrame): The feature matrix.
- y (pd.Series): The target variable.
- n_splits (int, optional): The number of splits in K-Fold cross-validation.
Defaults to 2.
- n_repeats (int, optional): The number of times the K-Fold cross-validation is repeated.
Defaults to 1.
- pipeline (object, optional): Optional data preprocessing pipeline. If provided,
it's applied to the data before training the model. Defaults to None.
Returns:
- Tuple[float, float]: A tuple containing the mean RMSE and standard deviation of RMSE
scores across cross-validation folds.
Example:
# Load your feature matrix (X) and target variable (y)
X, y = load_data()
# Perform Cross-Validation with CatBoost
mean_rmse, std_rmse = run_catboost_CV(X, y, n_splits=5, n_repeats=2, pipeline=data_pipeline)
print(f"Mean RMSE: {mean_rmse:.4f}")
print(f"Standard Deviation of RMSE: {std_rmse:.4f}")
Notes:
- Ensure that the input data `X` and `y` are properly preprocessed and do not contain any
missing values.
- The function uses CatBoost for regression with optional data preprocessing via the `pipeline`.
- RMSE is a common metric for regression tasks, and lower values indicate better model
performance.
"""
results = []
# Extract feature names and data types
features = X.columns[~X.columns.str.contains("price")]
numerical_features = X.select_dtypes("number").columns.to_list()
categorical_features = X.select_dtypes("object").columns.to_list()
# Create a K-Fold cross-validator
CV = model_selection.RepeatedKFold(
n_splits=n_splits, n_repeats=n_repeats, random_state=utils.Configuration.seed
)
for train_fold_index, val_fold_index in tqdm(CV.split(X)):
X_train_fold, X_val_fold = X.loc[train_fold_index], X.loc[val_fold_index]
y_train_fold, y_val_fold = y.loc[train_fold_index], y.loc[val_fold_index]
# Apply optional data preprocessing pipeline
if pipeline is not None:
X_train_fold = pipeline.fit_transform(X_train_fold)
X_val_fold = pipeline.transform(X_val_fold)
# Create CatBoost datasets
catboost_train = Pool(
X_train_fold,
y_train_fold,
cat_features=categorical_features,
)
catboost_valid = Pool(
X_val_fold,
y_val_fold,
cat_features=categorical_features,
)
# Initialize and train the CatBoost model
model = catboost.CatBoostRegressor(**utils.Configuration.catboost_params)
model.fit(
catboost_train,
eval_set=[catboost_valid],
early_stopping_rounds=utils.Configuration.early_stopping_round,
verbose=utils.Configuration.verbose,
use_best_model=True,
)
# Calculate OOF validation predictions
valid_pred = model.predict(X_val_fold)
RMSE_score = metrics.root_mean_squared_error(y_val_fold, valid_pred)
results.append(RMSE_score)
return np.mean(results), np.std(results)
Now, let’s proceed to train our model with the updated settings:
Note that we’ve reduced the number of iterations in Notebook 6 compared to Notebook 5 to minimize the training duration. In Notebook 5: iterations = 1000, default learning rate = 0.03 In Notebook 6: iterations = 100, learning rate = 0.2
The performance of the baseline model: 0.1125
An outlier is a data point that significantly differs from the rest of the data. One common way to define an outlier is a data point that falls more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile. Detecting and removing outliers from the dataset is crucial for building a stable model that can effectively generalize to new data.
When we create a scatter plot of our features (as shown in Figure 1), such as cadastral income against living area, and adjust the points’ color and size based on price, we can identify at least two data points that notably deviate from the expected range of values. One data point suggests a 300 m2 property is associated with a cadastral income exceeding 320,000 EURO, while the other point indicates a 2,500 EUR cadastral income for an 11,000 m2 property. Both observations seem implausible when compared to the majority of data points on the graph.
pd.concat([X, y], axis=1).pipe(
lambda df: ggplot(
df, aes("cadastral_income", "living_area", fill="price", size="price")
)
+ geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)
+ scale_fill_continuous(low="#1a9641", high="#d7191c")
+ labs(
title="Assessing Potential Outliers",
subtitle=""" Outliers pose a challenge for gradient boosting methods since boosting constructs each tree based on the errors of the previous trees.
Outliers, having significantly larger errors than non-outliers, can excessively divert the model's attention toward these data points.
""",
x="Cadastral income (EUR)",
y="Living area (m2)",
)
+ theme(
plot_subtitle=element_text(
size=12, face="italic"
), # Customize subtitle appearance
plot_title=element_text(size=15, face="bold"), # Customize title appearance
)
+ ggsize(800, 600)
)
For identifying potential outliers within our data, we can employ Scikit-learn
’s LocalOutlierFactor
. This algorithm, known as the Local Outlier Factor (LOF), is an unsupervised technique for anomaly detection. It assesses the local density deviation of a data point relative to its neighboring points. LOF identifies outliers as those data points demonstrating notably lower density in comparison to their neighbors.
In the provided code, we’ve created a function called identify_outliers
. This function generates a mask that we can use to filter out data points potentially flagged as outliers.
def identify_outliers(df: pd.DataFrame) -> pd.Series:
"""
Identify outliers in a DataFrame.
This function uses a Local Outlier Factor (LOF) algorithm to identify outliers in a given
DataFrame. It operates on both numerical and categorical features, and it returns a binary
Series where `True` represents an outlier and `False` represents a non-outlier.
Parameters:
- df (pd.DataFrame): The input DataFrame containing features for outlier identification.
Returns:
- pd.Series: A Boolean Series indicating outliers (True) and non-outliers (False).
Example:
# Load your DataFrame with features (df)
df = load_data()
# Identify outliers using the function
outlier_mask = identify_outliers(df)
# Use the outlier mask to filter your DataFrame
filtered_df = df[~outlier_mask] # Keep non-outliers
Notes:
- The function uses Local Outlier Factor (LOF) with default parameters for identifying outliers.
- Numerical features are imputed using median values, and categorical features are one-hot encoded
and imputed with median values.
- The resulting Boolean Series is `True` for outliers and `False` for non-outliers.
"""
# Extract numerical and categorical feature names
NUMERICAL_FEATURES = df.select_dtypes("number").columns.tolist()
CATEGORICAL_FEATURES = df.select_dtypes("object").columns.tolist()
# Define transformers for preprocessing
numeric_transformer = pipeline.Pipeline(
steps=[("imputer", impute.SimpleImputer(strategy="median"))]
)
categorical_transformer = pipeline.Pipeline(
steps=[
("encoder", preprocessing.OneHotEncoder(handle_unknown="ignore")),
("imputer", impute.SimpleImputer(strategy="median")),
]
)
# Create a ColumnTransformer to handle both numerical and categorical features
preprocessor = compose.ColumnTransformer(
transformers=[
("num", numeric_transformer, NUMERICAL_FEATURES),
("cat", categorical_transformer, CATEGORICAL_FEATURES),
]
)
# Initialize the LocalOutlierFactor model
clf = neighbors.LocalOutlierFactor()
# Fit LOF to preprocessed data and make predictions
y_pred = clf.fit_predict(preprocessor.fit_transform(df))
# Adjust LOF predictions to create a binary outlier mask
y_pred_adjusted = [1 if x == -1 else 0 for x in y_pred]
outlier_mask = pd.Series(y_pred_adjusted) == 0
return outlier_mask
As a comparison, here’s the scatter plot after removing outliers. It appears that the LocalOutlierFactor method was effective in addressing the outlier data points.
outlier_mask = pre_process.identify_outliers(X)
X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True)
y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)
(
pd.concat([X_wo_outliers, y_wo_outliers], axis=1)
# .loc[lambda df: pre_process.identify_outliers(df.loc[:, :"living_area"])]
.pipe(
lambda df: ggplot(
df, aes("cadastral_income", "living_area", fill="price", size="price")
)
+ geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)
+ scale_fill_continuous(low="#1a9641", high="#d7191c")
+ labs(
title="Assessing Potential Outliers",
subtitle=""" By employing the default parameters of LocalOutlierFactor, we've reduced our training set from 3660 instances to 3427.
This is expected to enhance our model's performance and its ability to generalize well to new data.
""",
x="Cadastral income (EUR)",
y="Living area (m2)",
)
+ theme(
plot_subtitle=element_text(
size=12, face="italic"
), # Customize subtitle appearance
plot_title=element_text(size=15, face="bold"), # Customize title appearance
)
+ ggsize(800, 600)
)
)
Now, let’s assess whether our efforts to improve the model by addressing outliers have enhanced its predictive capabilities:
(0.11052917780860605, 0.004569457889717371)
By removing the outliers, our cross-validation RMSE score decreased from 0.1125 to 0.1105.
Feature engineering is vital in machine learning as it directly influences a model’s performance and predictive capabilities. By crafting and selecting relevant features, it allows the model to capture meaningful patterns and relationships within the data. Effective feature engineering helps improve model accuracy, enhances its ability to generalize to new data, and enables the extraction of valuable insights, ultimately driving the success and efficacy of machine learning algorithms.
Feature Engineering ideas we will test in this section: - Utilize categorical columns for grouping and transform each numerical variable based on the median. - Generate bins from the continuous variables and apply the same process as described above. - Introduce polynomial features, either individually with a single feature or in combinations of two features. - Form clusters of instances using k-means clustering to capture data similarities and use these clusters as additional features. - Implement other ideas derived from empirical observations or assumptions
The idea behind this feature engineering step is to leverage categorical columns as grouping criteria and then calculate the median value for each numerical variable within each group. By doing so, it aims to create new features that capture the central tendency of the numerical data for different categories, allowing the model to better understand and utilize the inherent patterns and variations within the data.
state 9
kitchen_type 9
street 456
building_condition 7
city 230
dtype: int64
def FE_categorical_transform(
X: pd.DataFrame, y: pd.Series, transform_type: str = "mean"
) -> pd.DataFrame:
"""
Feature Engineering: Transform categorical features using CatBoost Cross-Validation.
This function performs feature engineering by transforming categorical features using CatBoost
Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean
Squared Error (RMSE) scores for various combinations of categorical and numerical features.
Parameters:
- X (pd.DataFrame): The input DataFrame containing both categorical and numerical features.
- transform_type (str, optional): The transformation type, such as "mean" or other valid
CatBoost transformations. Defaults to "mean".
Returns:
- pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "categorical," and "numerical,"
sorted by "mean_OOFs" in ascending order.
Example:
# Load your DataFrame with features (X)
X = load_data()
# Perform feature engineering
result_df = FE_categorical_transform(X, transform_type="mean")
# View the DataFrame with sorted results
print(result_df.head())
Notes:
- This function uses CatBoost Cross-Validation to assess the quality of transformations for
various combinations of categorical and numerical features.
- The resulting DataFrame provides insights into the effectiveness of different transformations.
- Feature engineering can help improve the performance of machine learning models.
"""
# Initialize a list to store results
results = []
# Get a list of categorical and numerical columns
categorical_columns = X.select_dtypes("object").columns
numerical_columns = X.select_dtypes("number").columns
# Combine the loops to have a single progress bar
for categorical in tqdm(categorical_columns, desc="Progress"):
for numerical in tqdm(numerical_columns):
# Create a deep copy of the input data
temp = X.copy(deep=True)
# Calculate the transformation for each group within the categorical column
temp["new_column"] = temp.groupby(categorical)[numerical].transform(
transform_type
)
# Run CatBoost Cross-Validation with the transformed data
mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)
# Store the results as a tuple
result = (mean_OOF, std_OOF, categorical, numerical)
results.append(result)
del temp, mean_OOF, std_OOF
# Create a DataFrame from the results and sort it by mean OOF scores
result_df = pd.DataFrame(
results, columns=["mean_OOFs", "std_OOFs", "categorical", "numerical"]
)
result_df = result_df.sort_values(by="mean_OOFs")
return result_df
Please, bear in mind that these feature engineering steps were precomputed due to the considerable computational time required. The outcomes were saved rather than executed during the notebook rendering to save time. However, it’s important to note that the results should remain unchanged.
Couldn't find program: 'echo'
Couldn't find program: 'echo'
As evident, the best result was obtained by treating the city feature as a categorical variable and calculating the median of cadastral_income based on this categorization. This result aligns logically with the feature importances seen in Part 3.
mean_OOFs | std_OOFs | categorical | numerical | |
---|---|---|---|---|
53 | 0.108973 | 0.006262 | city | cadastral_income |
39 | 0.108985 | 0.004980 | building_condition | yearly_theoretical_total_energy_consumption |
33 | 0.109381 | 0.005434 | building_condition | bedrooms |
31 | 0.109478 | 0.004887 | street | cadastral_income |
43 | 0.109540 | 0.004944 | building_condition | living_area |
The idea behind this feature engineering step is to discretize continuous variables by creating bins or categories from their values. These bins then serve as categorical columns. By using these new categorical columns for grouping, we can transform each numerical variable by replacing its values with the median of the respective category it belongs to, just like the feature engineering method we demonstrated above.
def FE_continuous_transform(
X: pd.DataFrame, y: pd.Series, transform_type: str = "mean"
) -> pd.DataFrame:
"""
Feature Engineering: Transform continuous features using CatBoost Cross-Validation.
This function performs feature engineering by transforming continuous features using CatBoost
Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean
Squared Error (RMSE) scores for various combinations of discretized and transformed continuous
features.
Parameters:
- X (pd.DataFrame): The input DataFrame containing both continuous and categorical features.
- y (pd.Series): The target variable for prediction.
- transform_type (str, optional): The transformation type, such as "mean" or other valid
CatBoost transformations. Defaults to "mean".
Returns:
- pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "discretized_continuous,"
and "transformed_continuous," sorted by "mean_OOFs" in ascending order.
Example:
# Load your DataFrame with features (X) and target variable (y)
X, y = load_data()
# Perform feature engineering
result_df = FE_continuous_transform(X, y, transform_type="mean")
# View the DataFrame with sorted results
print(result_df.head())
Notes:
- This function uses CatBoost Cross-Validation to assess the quality of transformations for
various combinations of discretized and transformed continuous features.
- The number of bins for discretization is determined using Sturges' rule.
- The resulting DataFrame provides insights into the effectiveness of different transformations.
- Feature engineering can help improve the performance of machine learning models.
"""
# Initialize a list to store results
results = []
# Get a list of continuous and numerical columns
continuous_columns = X.select_dtypes("number").columns
optimal_bins = int(np.floor(np.log2(X.shape[0])) + 1)
# Combine the loops to have a single progress bar
for discretized_continuous in tqdm(continuous_columns, desc="Progress:"):
for transformed_continuous in tqdm(continuous_columns):
if discretized_continuous != transformed_continuous:
# Create a deep copy of the input data
temp = X.copy(deep=True)
discretizer = pipeline.Pipeline(
steps=[
("imputer", impute.SimpleImputer(strategy="median")),
(
"add_bins",
preprocessing.KBinsDiscretizer(
encode="ordinal", n_bins=optimal_bins
),
),
]
)
temp[discretized_continuous] = discretizer.fit_transform(
X[[discretized_continuous]]
)
# Calculate the transformation for each group within the categorical column
temp["new_column"] = temp.groupby(discretized_continuous)[
transformed_continuous
].transform(transform_type)
# Run CatBoost Cross-Validation with the transformed data
mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)
# Store the results as a tuple
result = (
mean_OOF,
std_OOF,
discretized_continuous,
transformed_continuous,
)
results.append(result)
del temp, mean_OOF, std_OOF
# Create a DataFrame from the results and sort it by mean OOF scores
result_df = pd.DataFrame(
results,
columns=[
"mean_OOFs",
"std_OOFs",
"discretized_continuous",
"transformed_continuous",
],
)
result_df = result_df.sort_values(by="mean_OOFs")
return result_df
%%script echo skipping
FE_continuous_transform_mean = feature_engineering.FE_continuous_transform(
X_wo_outliers, y_wo_outliers
)
FE_continuous_transform_mean.to_parquet(
f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_continuous_transform_mean")}.parquet.gzip',
compression="gzip",
)
FE_categorical_transform_mean.head(15)
Couldn't find program: 'echo'
This approach was not as effective as our prior method. However, combining bathrooms with yearly_theoretical_total_energy_consumption yielded the best outcome.
mean_OOFs | std_OOFs | discretized_continuous | transformed_continuous | |
---|---|---|---|---|
55 | 0.109328 | 0.005094 | bathrooms | yearly_theoretical_total_energy_consumption |
59 | 0.109328 | 0.005094 | bathrooms | living_area |
58 | 0.109328 | 0.005094 | bathrooms | cadastral_income |
57 | 0.109328 | 0.005094 | bathrooms | lat |
56 | 0.109328 | 0.005094 | bathrooms | surface_of_the_plot |
50 | 0.109328 | 0.005094 | bathrooms | bedrooms |
52 | 0.109328 | 0.005094 | bathrooms | toilets |
39 | 0.109417 | 0.004831 | lng | living_area |
1 | 0.109426 | 0.004587 | bedrooms | toilets |
4 | 0.109426 | 0.004587 | bedrooms | bathrooms |
The idea behind introducing polynomial features is to capture non-linear relationships within the data. By raising individual features to higher powers or considering interactions between pairs of features, this step allows the model to better represent complex patterns that cannot be adequately expressed with linear relationships alone. It enhances the model’s ability to learn and predict outcomes that exhibit curvilinear or interactive behavior.
def FE_polynomial_features(
X: pd.DataFrame, y: pd.Series, combinations: int = 1
) -> pd.DataFrame:
"""
Generate polynomial features for combinations of numerical columns and train a CatBoost model.
Parameters:
X (pd.DataFrame): The input DataFrame with features.
y (pd.Series): The target variable.
combinations (int, optional): The number of combinations of numerical columns. Default is 1.
Returns:
pd.DataFrame: A DataFrame containing results sorted by mean OOF scores.
Example:
X_wo_outliers = pd.DataFrame(...) # Your input data
y_wo_outliers = pd Series(...) # Your target variable
result = FE_polynomial_features(X_wo_outliers, y_wo_outliers)
Transformations:
- Imputes missing values in numerical columns using the median.
- Generates polynomial features, including interaction terms, for selected numerical columns.
- Trains a CatBoost model and calculates mean and standard deviation of out-of-fold (OOF) scores.
"""
# Initialize a list to store results
results = []
# Get a list of continuous and numerical columns
numerical_columns = X.select_dtypes("number").columns
# Combine the loops to have a single progress bar
for numerical_col in tqdm(
list(itertools.combinations(numerical_columns, r=combinations))
):
polyfeatures = compose.make_column_transformer(
(
pipeline.make_pipeline(
impute.SimpleImputer(strategy="median"),
preprocessing.PolynomialFeatures(interaction_only=False),
),
list(numerical_col),
),
remainder="passthrough",
).set_output(transform="pandas")
temp = polyfeatures.fit_transform(X)
mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)
# Store the results as a tuple
result = (
mean_OOF,
std_OOF,
numerical_col,
)
results.append(result)
del temp, mean_OOF, std_OOF
# Create a DataFrame from the results and sort it by mean OOF scores
result_df = pd.DataFrame(
results,
columns=[
"mean_OOFs",
"std_OOFs",
"numerical_col",
],
)
result_df = result_df.sort_values(by="mean_OOFs")
return result_df
Let’s see the impact of applying polynomial feature engineering to a single feature.
%%script echo skipping
FE_polynomial_features_combinations_1 = FE_polynomial_features(
X_wo_outliers, y_wo_outliers
)
FE_polynomial_features_combinations_1.to_parquet(
f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_1")}.parquet.gzip',
compression="gzip",
)
FE_polynomial_features_combinations_1.head(15)
Couldn't find program: 'echo'
mean_OOFs | std_OOFs | numerical_col | |
---|---|---|---|
10 | 0.109938 | 0.005047 | [living_area] |
9 | 0.110339 | 0.004012 | [cadastral_income] |
3 | 0.110628 | 0.004018 | [lng] |
0 | 0.111066 | 0.004765 | [bedrooms] |
8 | 0.111099 | 0.005039 | [lat] |
4 | 0.111166 | 0.004879 | [primary_energy_consumption] |
6 | 0.111271 | 0.004908 | [yearly_theoretical_total_energy_consumption] |
1 | 0.111276 | 0.005359 | [number_of_frontages] |
7 | 0.111332 | 0.004815 | [surface_of_the_plot] |
2 | 0.111782 | 0.004741 | [toilets] |
How about two features combined…
%%script echo skipping
FE_polynomial_features_combinations_2 = FE_polynomial_features(
X_wo_outliers, y_wo_outliers, combinations=2
)
FE_polynomial_features_combinations_2.to_parquet(
f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_2")}.parquet.gzip',
compression="gzip",
)
FE_polynomial_features_combinations_2.head(15)
Couldn't find program: 'echo'
mean_OOFs | std_OOFs | numerical_col | |
---|---|---|---|
31 | 0.109413 | 0.004690 | [lng, lat] |
9 | 0.109625 | 0.005558 | [bedrooms, living_area] |
53 | 0.109809 | 0.005321 | [lat, living_area] |
52 | 0.109814 | 0.003485 | [lat, cadastral_income] |
7 | 0.109847 | 0.005399 | [bedrooms, lat] |
19 | 0.109962 | 0.004999 | [toilets, lng] |
17 | 0.110057 | 0.004554 | [number_of_frontages, cadastral_income] |
42 | 0.110124 | 0.004511 | [bathrooms, lat] |
46 | 0.110128 | 0.004944 | [yearly_theoretical_total_energy_consumption, ... |
28 | 0.110154 | 0.004644 | [lng, bathrooms] |
The idea behind using k-means clustering in feature engineering is to group data points into clusters based on their similarity. By doing so, we create a new set of features that represent these clusters, which can capture patterns or relationships within the data that might be less apparent in the original features. These cluster features can be valuable for machine learning models, as they provide a more compact and informative representation of the data, potentially improving predictive performance.
class FeatureSelector(BaseEstimator, TransformerMixin):
"""
A transformer for selecting specific columns from a DataFrame.
This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base.
It overrides the fit and transform methods from the parent classes.
Attributes:
feature_names_in_ (list): The names of the features to select.
n_features_in_ (int): The number of features to select.
Methods:
fit(X, y=None): Fit the transformer. Returns self.
transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features.
"""
def __init__(self, feature_names_in_):
"""
Constructs all the necessary attributes for the FeatureSelector object.
Args:
feature_names_in_ (list): The names of the features to select.
"""
self.feature_names_in_ = feature_names_in_
self.n_features_in_ = len(feature_names_in_)
def fit(self, X, y=None):
"""
Fit the transformer. This method doesn't do anything as no fitting is necessary.
Args:
X (DataFrame): The input data.
y (array-like, optional): The target variable. Defaults to None.
Returns:
self: The instance itself.
"""
return self
def transform(self, X, y=None):
"""
Apply the transformation. Selects the features from the input data.
Args:
X (DataFrame): The input data.
y (array-like, optional): The target variable. Defaults to None.
Returns:
DataFrame: A DataFrame with only the selected features.
"""
return X.loc[:, self.feature_names_in_].copy(deep=True)
def FE_KMeans(
X: pd.DataFrame,
y: pd.Series,
n_clusters_min: int = 1,
n_clusters_max: int = 8,
) -> pd.DataFrame:
"""Performs K-Means clustering-based feature engineering followed by model training.
Args:
X (pd.DataFrame): The input feature matrix.
y (pd.Series): The target variable.
n_clusters_min (int, optional): The minimum number of clusters to consider. Defaults to 1.
n_clusters_max (int, optional): The maximum number of clusters to consider. Defaults to 8.
Returns:
pd.DataFrame: A DataFrame containing the results of feature engineering with K-Means clustering.
Example:
>>> results_df = FE_KNN(X_wo_outliers, y_wo_outliers)
"""
# Initialize a list to store results
results = []
# Get a list of continuous and numerical columns
numerical_columns = X.head().select_dtypes("number").columns.to_list()
categorical_columns = X.head().select_dtypes("object").columns.to_list()
for n_cluster in tqdm(range(n_clusters_min, n_clusters_max)):
# Prepare pipelines for corresponding columns:
numerical_pipeline = pipeline.Pipeline(
steps=[
("num_selector", FeatureSelector(numerical_columns)),
("imputer", impute.SimpleImputer(strategy="median")),
]
)
categorical_pipeline = pipeline.Pipeline(
steps=[
("cat_selector", FeatureSelector(categorical_columns)),
("imputer", impute.SimpleImputer(strategy="most_frequent")),
(
"onehot",
preprocessing.OneHotEncoder(
handle_unknown="ignore", sparse_output=False
),
),
]
)
# Put all the pipelines inside a FeatureUnion:
data_preprocessing_pipeline = pipeline.FeatureUnion(
n_jobs=-1,
transformer_list=[
("numerical_pipeline", numerical_pipeline),
("categorical_pipeline", categorical_pipeline),
],
)
temp = pd.DataFrame(data_preprocessing_pipeline.fit_transform(X))
KMeans = cluster.KMeans(n_init=10, n_clusters=n_cluster)
KMeans.fit_transform(temp)
groups = pd.Series(KMeans.labels_, name="groups")
concatanated_df = pd.concat([temp, groups], axis="columns")
mean_OOF, std_OOF = train_model.run_catboost_CV(X=concatanated_df, y=y)
# Store the results as a tuple
result = (
mean_OOF,
std_OOF,
n_cluster,
)
results.append(result)
del temp, mean_OOF, std_OOF, KMeans, groups, concatanated_df, result
# Create a DataFrame from the results and sort it by mean OOF scores
result_df = pd.DataFrame(
results,
columns=[
"mean_OOFs",
"std_OOFs",
"n_cluster",
],
)
result_df = result_df.sort_values(by="mean_OOFs")
return result_df
Couldn't find program: 'echo'
As k-means clustering is an unsupervised algorithm, determining the appropriate k-values requires testing various values to assess their impact on our validation scores. As observed, this approach didn’t yield significant results in our case, as the best validation score was obtained when n=1.
mean_OOFs | std_OOFs | n_cluster | |
---|---|---|---|
0 | 0.111884 | 0.005143 | 1 |
1 | 0.112153 | 0.006174 | 2 |
5 | 0.112264 | 0.005858 | 6 |
24 | 0.112313 | 0.005602 | 25 |
7 | 0.112326 | 0.005124 | 8 |
43 | 0.112333 | 0.005819 | 44 |
31 | 0.112352 | 0.005588 | 32 |
11 | 0.112366 | 0.005206 | 12 |
95 | 0.112454 | 0.006621 | 96 |
26 | 0.112472 | 0.006096 | 27 |
Though new features can be generated through systematic methods, domain knowledge can also inspire their creation. The idea behind this is to allow for the incorporation of unconventional or domain-specific insights that may not fit standard feature engineering techniques. It encourages the exploration of novel features or transformations based on practical experiences or theoretical assumptions to potentially uncover hidden patterns or relationships within the data. This open-ended approach can lead to creative and tailored feature engineering solutions.
Here are some ideas to consider:
def FE_ideas(X):
"""Performs additional feature engineering on the input DataFrame.
Args:
X (pd.DataFrame): The input DataFrame containing the original features.
Returns:
pd.DataFrame: A DataFrame with additional engineered features.
Example:
>>> engineered_data = FE_ideas(original_data)
"""
temp = X.assign(
energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption
/ df.primary_energy_consumption,
energy_efficiency_2=lambda df: df.primary_energy_consumption / df.living_area,
total_bathrooms=lambda df: df.toilets + df.bathrooms,
total_number_rooms=lambda df: df.toilets + df.bathrooms + df.bedrooms,
spaciousness_1=lambda df: df.living_area / df.surface_of_the_plot,
spaciousness_2=lambda df: df.living_area / df.total_number_rooms,
bargain_1=lambda df: df.cadastral_income / df.bedrooms,
bargain_2=lambda df: df.cadastral_income / df.living_area,
)
return temp.loc[:, "energy_efficiency_1":]
def FE_try_ideas(
X: pd.DataFrame,
y: pd.Series,
) -> pd.DataFrame:
"""Performs feature engineering experiments by adding new features and evaluating their impact on model performance.
Args:
X (pd.DataFrame): The input feature matrix.
y (pd.Series): The target variable.
Returns:
pd.DataFrame: A DataFrame containing the results of feature engineering experiments.
Example:
>>> results_df = FE_try_ideas(X, y)
"""
# Initialize a list to store results
results = []
# Get a list of continuous and numerical columns
numerical_columns = X.select_dtypes("number").columns
# Apply additional feature engineering ideas
feature_df = FE_ideas(X)
for feature in tqdm(feature_df.columns):
# Concatenate the original features with the newly engineered feature
temp = pd.concat([X, feature_df[feature]], axis="columns")
# Train the model with the augmented features and get the mean and standard deviation of OOF scores
mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)
# Store the results as a tuple
result = (
mean_OOF,
std_OOF,
feature,
)
results.append(result)
del temp, mean_OOF, std_OOF
# Create a DataFrame from the results and sort it by mean OOF scores
result_df = pd.DataFrame(
results,
columns=[
"mean_OOFs",
"std_OOFs",
"feature",
],
)
result_df = result_df.sort_values(by="mean_OOFs")
return result_df
Couldn't find program: 'echo'
As can be seen below, the best feature this time was spaciousness_1
, representing df.living_area
divided by df.surface_of_the_plot
.
mean_OOFs | std_OOFs | feature | |
---|---|---|---|
4 | 0.109305 | 0.004652 | spaciousness_1 |
7 | 0.109558 | 0.004372 | bargain_2 |
3 | 0.109969 | 0.005117 | total_number_rooms |
6 | 0.109976 | 0.004303 | bargain_1 |
2 | 0.110545 | 0.004884 | total_bathrooms |
5 | 0.110603 | 0.004715 | spaciousness_2 |
1 | 0.110666 | 0.005749 | energy_efficiency_2 |
0 | 0.110722 | 0.005120 | energy_efficiency_1 |
The initial model achieved the best mean out-of-folds score of 0.1107. However, we made modifications to expedite training by reducing iterations to 100 and increasing the learning rate to 0.2, resulting in a new baseline model with a score of 0.1105 after outlier removal. This serves as our reference point to assess the impact of various feature engineering techniques.
Subsequent feature engineering approaches, including utilizing categorical columns for groupby and transformation, creating bins from continuous data, and implementing other ideas, led to marginal score improvements, with the lowest at 0.1089. Polynomial features, with n=2 and n=1, demonstrated slightly higher scores of 0.1094 and 0.1099, respective.
Now, we will proceed to assess the efficacy of two of the best approaches, namely: utilizing categorical columns for groupby and transformation and implementing additional ideas. We will conduct this evaluation using CatBoost’s built-in select_features
as outlined in part 3. Let’s dive in…
Condition | Best mean OOFs | std OOFs |
---|---|---|
Original | 0.1107 | NA |
Use categorical columns for groupby/transform | 0.1089 | 0.0062 |
Create bins from continuous data and use groupby/transform | 0.1093 | 0.0050 |
Implementing the rest of the ideas | 0.1093 | 0.0046 |
Polynomial features (n=2) | 0.1094 | 0.0046 |
Polynomial features (n=1) | 0.1099 | 0.0050 |
After Outlier filter | 0.1105 | 0.0045 |
k-means clustering | 0.1118 | 0.0051 |
Sped up version | 0.1125 | 0.0044 |
def prepare_df_for_final_feature_selection(X):
return X.assign(
city_group=lambda df: df.groupby("city")["cadastral_income"].transform(
"median"
),
building_condition_group=lambda df: df.groupby("building_condition")[
"yearly_theoretical_total_energy_consumption"
].transform("median"),
energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption
/ df.primary_energy_consumption,
energy_efficiency_2=lambda df: df.primary_energy_consumption / df.living_area,
total_bathrooms=lambda df: df.toilets + df.bathrooms,
total_number_rooms=lambda df: df.toilets + df.bathrooms + df.bedrooms,
spaciousness_1=lambda df: df.living_area / df.surface_of_the_plot,
spaciousness_2=lambda df: df.living_area / df.total_number_rooms,
bargain_1=lambda df: df.cadastral_income / df.bedrooms,
bargain_2=lambda df: df.cadastral_income / df.living_area,
)
X_final_feature_selection = prepare_df_for_final_feature_selection(X_wo_outliers)
regressor = catboost.CatBoostRegressor(
iterations=1000,
cat_features=X_final_feature_selection.select_dtypes("object").columns.to_list(),
random_seed=utils.Configuration.seed,
loss_function="RMSE",
)
rfe_dict = regressor.select_features(
algorithm="RecursiveByShapValues",
shap_calc_type="Exact",
X=X_train,
y=y_train,
eval_set=(X_val, y_val),
features_for_select="0-25",
num_features_to_select=1,
steps=20,
verbose=2000,
train_final_model=False,
plot=False,
)
Learning rate set to 0.059655
Step #1 out of 20
0: learn: 0.3071998 test: 0.3029816 best: 0.3029816 (0) total: 13.6ms remaining: 13.6s
999: learn: 0.0482570 test: 0.1076466 best: 0.1075005 (964) total: 14.9s remaining: 0us
bestTest = 0.1075004858
bestIteration = 964
Shrink model to first 965 iterations.
Feature #21 eliminated
Feature #20 eliminated
Feature #23 eliminated
Feature #2 eliminated
Step #2 out of 20
0: learn: 0.3070684 test: 0.3029727 best: 0.3029727 (0) total: 14.1ms remaining: 14.1s
999: learn: 0.0497732 test: 0.1094414 best: 0.1092710 (833) total: 14.1s remaining: 0us
bestTest = 0.1092709663
bestIteration = 833
Shrink model to first 834 iterations.
Feature #4 eliminated
Feature #22 eliminated
Feature #24 eliminated
Step #3 out of 20
0: learn: 0.3058962 test: 0.3017255 best: 0.3017255 (0) total: 14.2ms remaining: 14.1s
999: learn: 0.0515838 test: 0.1074155 best: 0.1074083 (998) total: 13.9s remaining: 0us
bestTest = 0.1074082848
bestIteration = 998
Shrink model to first 999 iterations.
Feature #12 eliminated
Feature #0 eliminated
Feature #7 eliminated
Step #4 out of 20
0: learn: 0.3062525 test: 0.3017927 best: 0.3017927 (0) total: 14.5ms remaining: 14.4s
999: learn: 0.0549456 test: 0.1092524 best: 0.1092519 (996) total: 12.8s remaining: 0us
bestTest = 0.1092518693
bestIteration = 996
Shrink model to first 997 iterations.
Feature #9 eliminated
Feature #3 eliminated
Step #5 out of 20
0: learn: 0.3069279 test: 0.3031549 best: 0.3031549 (0) total: 20.9ms remaining: 20.9s
999: learn: 0.0566761 test: 0.1092288 best: 0.1091824 (846) total: 12.5s remaining: 0us
bestTest = 0.1091824436
bestIteration = 846
Shrink model to first 847 iterations.
Feature #6 eliminated
Feature #11 eliminated
Step #6 out of 20
0: learn: 0.3065169 test: 0.3023901 best: 0.3023901 (0) total: 16.8ms remaining: 16.8s
999: learn: 0.0567811 test: 0.1114134 best: 0.1113041 (981) total: 14.9s remaining: 0us
bestTest = 0.1113040714
bestIteration = 981
Shrink model to first 982 iterations.
Feature #1 eliminated
Feature #25 eliminated
Step #7 out of 20
0: learn: 0.3063087 test: 0.3019867 best: 0.3019867 (0) total: 15.1ms remaining: 15.1s
999: learn: 0.0554371 test: 0.1123259 best: 0.1122143 (920) total: 31s remaining: 0us
bestTest = 0.1122143004
bestIteration = 920
Shrink model to first 921 iterations.
Feature #5 eliminated
Feature #8 eliminated
Step #8 out of 20
0: learn: 0.3068859 test: 0.3026476 best: 0.3026476 (0) total: 3.82ms remaining: 3.81s
999: learn: 0.0574991 test: 0.1152245 best: 0.1152077 (992) total: 2.1s remaining: 0us
bestTest = 0.1152077068
bestIteration = 992
Shrink model to first 993 iterations.
Feature #13 eliminated
Step #9 out of 20
0: learn: 0.3068033 test: 0.3025327 best: 0.3025327 (0) total: 2.82ms remaining: 2.81s
999: learn: 0.0640641 test: 0.1201366 best: 0.1201130 (689) total: 2.13s remaining: 0us
bestTest = 0.1201130073
bestIteration = 689
Shrink model to first 690 iterations.
Feature #19 eliminated
Step #10 out of 20
0: learn: 0.3069152 test: 0.3030754 best: 0.3030754 (0) total: 1.96ms remaining: 1.96s
999: learn: 0.0703171 test: 0.1242382 best: 0.1236966 (778) total: 2.13s remaining: 0us
bestTest = 0.1236965761
bestIteration = 778
Shrink model to first 779 iterations.
Feature #18 eliminated
Step #11 out of 20
0: learn: 0.3073732 test: 0.3033657 best: 0.3033657 (0) total: 2.05ms remaining: 2.05s
999: learn: 0.0750814 test: 0.1247368 best: 0.1246789 (969) total: 2.14s remaining: 0us
bestTest = 0.1246789365
bestIteration = 969
Shrink model to first 970 iterations.
Feature #10 eliminated
Step #12 out of 20
0: learn: 0.3073693 test: 0.3032811 best: 0.3032811 (0) total: 1.83ms remaining: 1.83s
999: learn: 0.0914355 test: 0.1353158 best: 0.1351822 (880) total: 2.25s remaining: 0us
bestTest = 0.1351822259
bestIteration = 880
Shrink model to first 881 iterations.
Step #13 out of 20
0: learn: 0.3073693 test: 0.3032811 best: 0.3032811 (0) total: 1.74ms remaining: 1.74s
999: learn: 0.0914355 test: 0.1353158 best: 0.1351822 (880) total: 2.4s remaining: 0us
bestTest = 0.1351822259
bestIteration = 880
Shrink model to first 881 iterations.
Feature #17 eliminated
Step #14 out of 20
0: learn: 0.3071755 test: 0.3033561 best: 0.3033561 (0) total: 1.86ms remaining: 1.86s
999: learn: 0.1071168 test: 0.1503970 best: 0.1496295 (720) total: 2.19s remaining: 0us
bestTest = 0.1496294939
bestIteration = 720
Shrink model to first 721 iterations.
Step #15 out of 20
0: learn: 0.3071755 test: 0.3033561 best: 0.3033561 (0) total: 2.74ms remaining: 2.74s
999: learn: 0.1071168 test: 0.1503970 best: 0.1496295 (720) total: 2.27s remaining: 0us
bestTest = 0.1496294939
bestIteration = 720
Shrink model to first 721 iterations.
Feature #14 eliminated
Step #16 out of 20
0: learn: 0.3069654 test: 0.3029158 best: 0.3029158 (0) total: 6.46ms remaining: 6.45s
999: learn: 0.1272584 test: 0.1579738 best: 0.1574747 (784) total: 2.15s remaining: 0us
bestTest = 0.1574747019
bestIteration = 784
Shrink model to first 785 iterations.
Step #17 out of 20
0: learn: 0.3069654 test: 0.3029158 best: 0.3029158 (0) total: 2.92ms remaining: 2.92s
999: learn: 0.1272584 test: 0.1579738 best: 0.1574747 (784) total: 2.2s remaining: 0us
bestTest = 0.1574747019
bestIteration = 784
Shrink model to first 785 iterations.
Step #18 out of 20
0: learn: 0.3069654 test: 0.3029158 best: 0.3029158 (0) total: 1.77ms remaining: 1.77s
999: learn: 0.1272584 test: 0.1579738 best: 0.1574747 (784) total: 2.15s remaining: 0us
bestTest = 0.1574747019
bestIteration = 784
Shrink model to first 785 iterations.
Feature #16 eliminated
Step #19 out of 20
0: learn: 0.3099541 test: 0.3058748 best: 0.3058748 (0) total: 4.3ms remaining: 4.3s
999: learn: 0.2075860 test: 0.2206064 best: 0.2148488 (114) total: 2.06s remaining: 0us
bestTest = 0.2148488447
bestIteration = 114
Shrink model to first 115 iterations.
Step #20 out of 20
0: learn: 0.3099541 test: 0.3058748 best: 0.3058748 (0) total: 4.43ms remaining: 4.43s
999: learn: 0.2075860 test: 0.2206064 best: 0.2148488 (114) total: 2.13s remaining: 0us
bestTest = 0.2148488447
bestIteration = 114
Shrink model to first 115 iterations.
Based on our evaluation, it is recommended to retain the following features:
However, we should remove the following two features since our analysis indicates that better features have been incorporated:
These feature selections should help optimize our model even further.
Based on these insights, we crafted the prepare_data_for_modelling
function, which is stored in the pre_process.py
file. This function includes the feature engineering steps we discussed, setting the stage for effective modeling performance.
def prepare_data_for_modelling(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
"""
Prepare data for machine learning modeling.
This function takes a DataFrame and prepares it for machine learning by performing the following steps:
1. Randomly shuffles the rows of the DataFrame.
2. Converts the 'price' column to the base 10 logarithm.
3. Fills missing values in categorical variables with 'missing value'.
4. Separates the features (X) and the target (y).
5. Identifies and filters out outlier values based on LocalOutlierFactor.
Parameters:
- df (pd.DataFrame): The input DataFrame containing the dataset.
Returns:
- Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y).
Example use case:
# Load your dataset into a DataFrame (e.g., df)
df = load_data()
# Prepare the data for modeling
X, y = prepare_data_for_modelling(df)
# Now you can use X and y for machine learning tasks.
Args:
df (pd.DataFrame): The input DataFrame containing the dataset.
Returns:
Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y).
"""
processed_df = (
df.sample(frac=1, random_state=utils.Configuration.seed)
.reset_index(drop=True)
.assign(
price=lambda df: np.log10(df.price),
city_group=lambda df: df.groupby("city")["cadastral_income"].transform(
"median"
),
building_condition_group=lambda df: df.groupby("building_condition")[
"yearly_theoretical_total_energy_consumption"
].transform("median"),
energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption
/ df.primary_energy_consumption,
energy_efficiency_2=lambda df: df.primary_energy_consumption
/ df.living_area,
bargain_1=lambda df: df.cadastral_income / df.bedrooms,
bargain_2=lambda df: df.cadastral_income / df.living_area,
)
)
# Fill missing categorical variables with "missing value"
for col in processed_df.columns:
if processed_df[col].dtype.name in ("bool", "object", "category"):
processed_df[col] = processed_df[col].fillna("missing value")
# Separate features (X) and target (y)
X = processed_df.loc[:, utils.Configuration.features_to_keep_v2]
y = processed_df[utils.Configuration.target_col]
outlier_mask = pre_process.identify_outliers(X)
X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True)
y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)
print(f"Shape of X and y with outliers: {X.shape}, {y.shape}")
print(
f"Shape of X and y without outliers: {X_wo_outliers.shape}, {y_wo_outliers.shape}"
)
return X_wo_outliers, y_wo_outliers
Shape of X and y with outliers: (3660, 14), (3660,)
Shape of X and y without outliers: (3427, 14), (3427,)
In Part 4, we began with an initial selection of 16 features based on the work in Part 3. However, as we conclude this article, we’ve found that we can streamline our feature set even further by removing kitchen_type
and toilets
resulting in improved performance, thanks to the addition of new features. While there’s potential for further optimizations, such as dimensional reduction, we are currently happy with our progress. In the next, and final, part, we will focus on fine-tuning our model for optimal predictive performance. See you there!