In Part 3, we looked at the significance of features in the initial scraped dataset using both the feature_importances_ method of CatBoostRegressor and SHAP values. We conducted feature elimination based on their importance and predictive capability.

In this upcoming section, we’ll implement a robust cross-validation strategy to accurately and consistently evaluate our model’s performance across multiple folds of the data. We will also identify and address potential outliers within our dataset, which is crucial to prevent their influence on the model’s predictions.

Additionally, we’ll further refine and expand our feature engineering efforts by exploring new methodologies to create informative features that increase our model’s predictive capabilities. Looking forward to these steps!

Note

You can explore the project’s app on its website. For more details, visit the GitHub repository.

Check out the series for a deeper dive: - Part 1: Characterizing the Data - Part 2: Building a Baseline Model - Part 3: Feature Selection - Part 4: Feature Engineering - Part 5: Fine-Tuning

Code

import sys
from pathlib import Path

sys.path.append(str(Path.cwd()))

Code

import gc
import itertools
from pathlib import Path
from typing import List, Optional, Tuple

import catboost
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import clear_output
from lets_plot import *
from lets_plot.mapping import as_discrete
from sklearn import (
    cluster,
    compose,
    ensemble,
    impute,
    metrics,
    model_selection,
    neighbors,
    pipeline,
    preprocessing,
)
from sklearn.base import BaseEstimator, TransformerMixin
from tqdm.notebook import tqdm

from helper import feature_engineering, pre_process, train_model, utils

LetsPlot.setup_html()

Prepare dataframe before modelling

Read in dataframe

Drawing from our findings in part 3, particularly with regards to our initial feature reduction efforts, we’ve developed a function named prepare_data_for_modelling. This function resides in the pre_process.py file, ensuring its reusability. The function performs essential data preprocessing steps, which include:

Randomly shuffling the rows in the DataFrame.
Transforming the ‘price’ column by taking the base 10 logarithm.
Handling missing values in categorical variables by replacing them with ‘missing value.’
Separating the dataset into features (X) and the target variable (y).

Let’s dive into the details of this function and prepare our X and y for the subsequent processing pipeline.

Code

df = pd.read_parquet(
    Path.cwd().joinpath("data").joinpath("2023-10-01_Processed_dataset_for_NB_use.gzip")
)

Code

def prepare_data(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Prepare data for machine learning modeling.

    This function takes a DataFrame and prepares it for machine learning by performing the following steps:
    1. Randomly shuffles the rows of the DataFrame.
    2. Converts the 'price' column to the base 10 logarithm.
    3. Fills missing values in categorical variables with 'missing value'.
    4. Separates the features (X) and the target (y).

    Parameters:
    - df (pd.DataFrame): The input DataFrame containing the dataset.

    Returns:
    - Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y).

    Example use case:
    # Load your dataset into a DataFrame (e.g., df)
    df = load_data()

    # Prepare the data for modeling
    X, y = prepare_data_for_modelling(df)

    # Now you can use X and y for machine learning tasks.
    """

    processed_df = (
        df.sample(frac=1, random_state=utils.Configuration.seed)
        .reset_index(drop=True)
        .assign(price=lambda df: np.log10(df.price))
    )

    # Fill missing categorical variables with "missing value"
    for col in processed_df.columns:
        if processed_df[col].dtype.name in ("bool", "object", "category"):
            processed_df[col] = processed_df[col].fillna("missing value")

    # Separate features (X) and target (y)
    X = processed_df.loc[:, utils.Configuration.features_to_keep_v1]
    y = processed_df[utils.Configuration.target_col]

    print(f"Shape of X and y: {X.shape}, {y.shape}")

    return X, y

Code

X, y = prepare_data(df)

Shape of X and y: (3660, 16), (3660,)

Cross-validation strategy

Our next critical step is to establish a well-structured cross-validation strategy. This step is imperative as it enables us to assess the effectiveness of various feature engineering approaches without risking overfitting our model. A robust cross-validation strategy ensures that our model’s performance evaluations are reliable and that the insights gained are generalizable to new data. To accomplish this, we will employ RepeatedKFold validation, setting the parameters with n_splits as 10 and n_repeats as 1.

In essence, this configuration signifies that we will perform a 10-fold cross-validation, and this entire process will be repeated once. Importantly, due to the modular nature of this function, we retain the flexibility to easily adapt and alter the design of our cross-validation strategy as needed.

Code

def run_catboost_CV(
    X: pd.DataFrame,
    y: pd.Series,
    n_splits: int = 10,
    n_repeats: int = 1,
    pipeline: Optional[object] = None,
) -> Tuple[float, float]:
    """
    Perform Cross-Validation with CatBoost for regression.

    This function conducts Cross-Validation using CatBoost for regression tasks. It iterates
    through folds, trains CatBoost models, and computes the mean and standard deviation of the
    Root Mean Squared Error (RMSE) scores across folds.

    Parameters:
    - X (pd.DataFrame): The feature matrix.
    - y (pd.Series): The target variable.
    - n_splits (int, optional): The number of splits in K-Fold cross-validation.
      Defaults to 2.
    - n_repeats (int, optional): The number of times the K-Fold cross-validation is repeated.
      Defaults to 1.
    - pipeline (object, optional): Optional data preprocessing pipeline. If provided,
      it's applied to the data before training the model. Defaults to None.

    Returns:
    - Tuple[float, float]: A tuple containing the mean RMSE and standard deviation of RMSE
      scores across cross-validation folds.

    Example:
    # Load your feature matrix (X) and target variable (y)
    X, y = load_data()

    # Perform Cross-Validation with CatBoost
    mean_rmse, std_rmse = run_catboost_CV(X, y, n_splits=5, n_repeats=2, pipeline=data_pipeline)

    print(f"Mean RMSE: {mean_rmse:.4f}")
    print(f"Standard Deviation of RMSE: {std_rmse:.4f}")

    Notes:
    - Ensure that the input data `X` and `y` are properly preprocessed and do not contain any
      missing values.
    - The function uses CatBoost for regression with optional data preprocessing via the `pipeline`.
    - RMSE is a common metric for regression tasks, and lower values indicate better model
      performance.
    """
    results = []

    # Extract feature names and data types
    features = X.columns[~X.columns.str.contains("price")]
    numerical_features = X.select_dtypes("number").columns.to_list()
    categorical_features = X.select_dtypes("object").columns.to_list()

    # Create a K-Fold cross-validator
    CV = model_selection.RepeatedKFold(
        n_splits=n_splits, n_repeats=n_repeats, random_state=utils.Configuration.seed
    )

    for train_fold_index, val_fold_index in tqdm(CV.split(X)):
        X_train_fold, X_val_fold = X.loc[train_fold_index], X.loc[val_fold_index]
        y_train_fold, y_val_fold = y.loc[train_fold_index], y.loc[val_fold_index]

        # Apply optional data preprocessing pipeline
        if pipeline is not None:
            X_train_fold = pipeline.fit_transform(X_train_fold)
            X_val_fold = pipeline.transform(X_val_fold)

        # Create CatBoost datasets
        catboost_train = Pool(
            X_train_fold,
            y_train_fold,
            cat_features=categorical_features,
        )
        catboost_valid = Pool(
            X_val_fold,
            y_val_fold,
            cat_features=categorical_features,
        )

        # Initialize and train the CatBoost model
        model = catboost.CatBoostRegressor(**utils.Configuration.catboost_params)
        model.fit(
            catboost_train,
            eval_set=[catboost_valid],
            early_stopping_rounds=utils.Configuration.early_stopping_round,
            verbose=utils.Configuration.verbose,
            use_best_model=True,
        )

        # Calculate OOF validation predictions
        valid_pred = model.predict(X_val_fold)

        RMSE_score = metrics.root_mean_squared_error(y_val_fold, valid_pred)

        results.append(RMSE_score)

    return np.mean(results), np.std(results)

Now, let’s proceed to train our model with the updated settings:

Code

train_model.run_catboost_CV(X, y)

(0.11251233080551612, 0.004459362099695207)

Note

Note that we’ve reduced the number of iterations in Notebook 6 compared to Notebook 5 to minimize the training duration. In Notebook 5: iterations = 1000, default learning rate = 0.03 In Notebook 6: iterations = 100, learning rate = 0.2

The performance of the baseline model: 0.1125

Outlier detection

An outlier is a data point that significantly differs from the rest of the data. One common way to define an outlier is a data point that falls more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile. Detecting and removing outliers from the dataset is crucial for building a stable model that can effectively generalize to new data.

When we create a scatter plot of our features (as shown in Figure 1), such as cadastral income against living area, and adjust the points’ color and size based on price, we can identify at least two data points that notably deviate from the expected range of values. One data point suggests a 300 m2 property is associated with a cadastral income exceeding 320,000 EURO, while the other point indicates a 2,500 EUR cadastral income for an 11,000 m2 property. Both observations seem implausible when compared to the majority of data points on the graph.

Code

pd.concat([X, y], axis=1).pipe(
    lambda df: ggplot(
        df, aes("cadastral_income", "living_area", fill="price", size="price")
    )
    + geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)
    + scale_fill_continuous(low="#1a9641", high="#d7191c")
    + labs(
        title="Assessing Potential Outliers",
        subtitle=""" Outliers pose a challenge for gradient boosting methods since boosting constructs each tree based on the errors of the previous trees. 
        Outliers, having significantly larger errors than non-outliers, can excessively divert the model's attention toward these data points.
            """,
        x="Cadastral income (EUR)",
        y="Living area (m2)",
    )
    + theme(
        plot_subtitle=element_text(
            size=12, face="italic"
        ),  # Customize subtitle appearance
        plot_title=element_text(size=15, face="bold"),  # Customize title appearance
    )
    + ggsize(800, 600)
)

Figure 1: Assessing Feature Cardinality: Percentage of Unique Values per Feature

For identifying potential outliers within our data, we can employ Scikit-learn’s LocalOutlierFactor. This algorithm, known as the Local Outlier Factor (LOF), is an unsupervised technique for anomaly detection. It assesses the local density deviation of a data point relative to its neighboring points. LOF identifies outliers as those data points demonstrating notably lower density in comparison to their neighbors.

In the provided code, we’ve created a function called identify_outliers. This function generates a mask that we can use to filter out data points potentially flagged as outliers.

Code

def identify_outliers(df: pd.DataFrame) -> pd.Series:
    """
    Identify outliers in a DataFrame.

    This function uses a Local Outlier Factor (LOF) algorithm to identify outliers in a given
    DataFrame. It operates on both numerical and categorical features, and it returns a binary
    Series where `True` represents an outlier and `False` represents a non-outlier.

    Parameters:
    - df (pd.DataFrame): The input DataFrame containing features for outlier identification.

    Returns:
    - pd.Series: A Boolean Series indicating outliers (True) and non-outliers (False).

    Example:
    # Load your DataFrame with features (df)
    df = load_data()

    # Identify outliers using the function
    outlier_mask = identify_outliers(df)

    # Use the outlier mask to filter your DataFrame
    filtered_df = df[~outlier_mask]  # Keep non-outliers

    Notes:
    - The function uses Local Outlier Factor (LOF) with default parameters for identifying outliers.
    - Numerical features are imputed using median values, and categorical features are one-hot encoded
      and imputed with median values.
    - The resulting Boolean Series is `True` for outliers and `False` for non-outliers.
    """
    # Extract numerical and categorical feature names
    NUMERICAL_FEATURES = df.select_dtypes("number").columns.tolist()
    CATEGORICAL_FEATURES = df.select_dtypes("object").columns.tolist()

    # Define transformers for preprocessing
    numeric_transformer = pipeline.Pipeline(
        steps=[("imputer", impute.SimpleImputer(strategy="median"))]
    )

    categorical_transformer = pipeline.Pipeline(
        steps=[
            ("encoder", preprocessing.OneHotEncoder(handle_unknown="ignore")),
            ("imputer", impute.SimpleImputer(strategy="median")),
        ]
    )

    # Create a ColumnTransformer to handle both numerical and categorical features
    preprocessor = compose.ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, NUMERICAL_FEATURES),
            ("cat", categorical_transformer, CATEGORICAL_FEATURES),
        ]
    )

    # Initialize the LocalOutlierFactor model
    clf = neighbors.LocalOutlierFactor()

    # Fit LOF to preprocessed data and make predictions
    y_pred = clf.fit_predict(preprocessor.fit_transform(df))

    # Adjust LOF predictions to create a binary outlier mask
    y_pred_adjusted = [1 if x == -1 else 0 for x in y_pred]
    outlier_mask = pd.Series(y_pred_adjusted) == 0

    return outlier_mask

As a comparison, here’s the scatter plot after removing outliers. It appears that the LocalOutlierFactor method was effective in addressing the outlier data points.

Code

outlier_mask = pre_process.identify_outliers(X)

X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True)
y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)


(
    pd.concat([X_wo_outliers, y_wo_outliers], axis=1)
    # .loc[lambda df: pre_process.identify_outliers(df.loc[:, :"living_area"])]
    .pipe(
        lambda df: ggplot(
            df, aes("cadastral_income", "living_area", fill="price", size="price")
        )
        + geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)
        + scale_fill_continuous(low="#1a9641", high="#d7191c")
        + labs(
            title="Assessing Potential Outliers",
            subtitle=""" By employing the default parameters of LocalOutlierFactor, we've reduced our training set from 3660 instances to 3427.
            This is expected to enhance our model's performance and its ability to generalize well to new data.
            """,
            x="Cadastral income (EUR)",
            y="Living area (m2)",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 600)
    )
)

Now, let’s assess whether our efforts to improve the model by addressing outliers have enhanced its predictive capabilities:

Code

train_model.run_catboost_CV(X_wo_outliers, y_wo_outliers)

(0.11052917780860605, 0.004569457889717371)

Note

By removing the outliers, our cross-validation RMSE score decreased from 0.1125 to 0.1105.

Feature Engineering

Feature engineering is vital in machine learning as it directly influences a model’s performance and predictive capabilities. By crafting and selecting relevant features, it allows the model to capture meaningful patterns and relationships within the data. Effective feature engineering helps improve model accuracy, enhances its ability to generalize to new data, and enables the extraction of valuable insights, ultimately driving the success and efficacy of machine learning algorithms.

Feature Engineering ideas we will test in this section: - Utilize categorical columns for grouping and transform each numerical variable based on the median. - Generate bins from the continuous variables and apply the same process as described above. - Introduce polynomial features, either individually with a single feature or in combinations of two features. - Form clusters of instances using k-means clustering to capture data similarities and use these clusters as additional features. - Implement other ideas derived from empirical observations or assumptions

Utilize categorical columns for grouping and transform each numerical variable based on the median

The idea behind this feature engineering step is to leverage categorical columns as grouping criteria and then calculate the median value for each numerical variable within each group. By doing so, it aims to create new features that capture the central tendency of the numerical data for different categories, allowing the model to better understand and utilize the inherent patterns and variations within the data.

Code

# Number of unique categories per categorical variables:

X_wo_outliers.select_dtypes("object").nunique()

state                   9
kitchen_type            9
street                456
building_condition      7
city                  230
dtype: int64

Code

def FE_categorical_transform(
    X: pd.DataFrame, y: pd.Series, transform_type: str = "mean"
) -> pd.DataFrame:
    """
    Feature Engineering: Transform categorical features using CatBoost Cross-Validation.

    This function performs feature engineering by transforming categorical features using CatBoost
    Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean
    Squared Error (RMSE) scores for various combinations of categorical and numerical features.

    Parameters:
    - X (pd.DataFrame): The input DataFrame containing both categorical and numerical features.
    - transform_type (str, optional): The transformation type, such as "mean" or other valid
      CatBoost transformations. Defaults to "mean".

    Returns:
    - pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "categorical," and "numerical,"
      sorted by "mean_OOFs" in ascending order.

    Example:
    # Load your DataFrame with features (X)
    X = load_data()

    # Perform feature engineering
    result_df = FE_categorical_transform(X, transform_type="mean")

    # View the DataFrame with sorted results
    print(result_df.head())

    Notes:
    - This function uses CatBoost Cross-Validation to assess the quality of transformations for
      various combinations of categorical and numerical features.
    - The resulting DataFrame provides insights into the effectiveness of different transformations.
    - Feature engineering can help improve the performance of machine learning models.
    """
    # Initialize a list to store results
    results = []

    # Get a list of categorical and numerical columns
    categorical_columns = X.select_dtypes("object").columns
    numerical_columns = X.select_dtypes("number").columns

    # Combine the loops to have a single progress bar
    for categorical in tqdm(categorical_columns, desc="Progress"):
        for numerical in tqdm(numerical_columns):
            # Create a deep copy of the input data
            temp = X.copy(deep=True)

            # Calculate the transformation for each group within the categorical column
            temp["new_column"] = temp.groupby(categorical)[numerical].transform(
                transform_type
            )

            # Run CatBoost Cross-Validation with the transformed data
            mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)

            # Store the results as a tuple
            result = (mean_OOF, std_OOF, categorical, numerical)
            results.append(result)

            del temp, mean_OOF, std_OOF

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results, columns=["mean_OOFs", "std_OOFs", "categorical", "numerical"]
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

Note

Please, bear in mind that these feature engineering steps were precomputed due to the considerable computational time required. The outcomes were saved rather than executed during the notebook rendering to save time. However, it’s important to note that the results should remain unchanged.

Code

%%script echo skipping

FE_categorical_transform_mean = feature_engineering.FE_categorical_transform(
    X_wo_outliers, y_wo_outliers
)

Couldn't find program: 'echo'

Code

%%script echo skipping

FE_categorical_transform_mean.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_categorical_transform_mean")}.parquet.gzip',
    compression="gzip",
)

FE_categorical_transform_mean.head(15)

Couldn't find program: 'echo'

As evident, the best result was obtained by treating the city feature as a categorical variable and calculating the median of cadastral_income based on this categorization. This result aligns logically with the feature importances seen in Part 3.

Code

pd.read_parquet(
    Path.cwd().joinpath("data").joinpath("FE_categorical_transform_mean.parquet.gzip")
).head()

	mean_OOFs	std_OOFs	categorical	numerical
53	0.108973	0.006262	city	cadastral_income
39	0.108985	0.004980	building_condition	yearly_theoretical_total_energy_consumption
33	0.109381	0.005434	building_condition	bedrooms
31	0.109478	0.004887	street	cadastral_income
43	0.109540	0.004944	building_condition	living_area

Generate bins from the continuous variables

The idea behind this feature engineering step is to discretize continuous variables by creating bins or categories from their values. These bins then serve as categorical columns. By using these new categorical columns for grouping, we can transform each numerical variable by replacing its values with the median of the respective category it belongs to, just like the feature engineering method we demonstrated above.

Code

def FE_continuous_transform(
    X: pd.DataFrame, y: pd.Series, transform_type: str = "mean"
) -> pd.DataFrame:
    """
    Feature Engineering: Transform continuous features using CatBoost Cross-Validation.

    This function performs feature engineering by transforming continuous features using CatBoost
    Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean
    Squared Error (RMSE) scores for various combinations of discretized and transformed continuous
    features.

    Parameters:
    - X (pd.DataFrame): The input DataFrame containing both continuous and categorical features.
    - y (pd.Series): The target variable for prediction.
    - transform_type (str, optional): The transformation type, such as "mean" or other valid
      CatBoost transformations. Defaults to "mean".

    Returns:
    - pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "discretized_continuous,"
      and "transformed_continuous," sorted by "mean_OOFs" in ascending order.

    Example:
    # Load your DataFrame with features (X) and target variable (y)
    X, y = load_data()

    # Perform feature engineering
    result_df = FE_continuous_transform(X, y, transform_type="mean")

    # View the DataFrame with sorted results
    print(result_df.head())

    Notes:
    - This function uses CatBoost Cross-Validation to assess the quality of transformations for
      various combinations of discretized and transformed continuous features.
    - The number of bins for discretization is determined using Sturges' rule.
    - The resulting DataFrame provides insights into the effectiveness of different transformations.
    - Feature engineering can help improve the performance of machine learning models.
    """
    # Initialize a list to store results
    results = []

    # Get a list of continuous and numerical columns
    continuous_columns = X.select_dtypes("number").columns
    optimal_bins = int(np.floor(np.log2(X.shape[0])) + 1)

    # Combine the loops to have a single progress bar
    for discretized_continuous in tqdm(continuous_columns, desc="Progress:"):
        for transformed_continuous in tqdm(continuous_columns):
            if discretized_continuous != transformed_continuous:
                # Create a deep copy of the input data
                temp = X.copy(deep=True)

                discretizer = pipeline.Pipeline(
                    steps=[
                        ("imputer", impute.SimpleImputer(strategy="median")),
                        (
                            "add_bins",
                            preprocessing.KBinsDiscretizer(
                                encode="ordinal", n_bins=optimal_bins
                            ),
                        ),
                    ]
                )

                temp[discretized_continuous] = discretizer.fit_transform(
                    X[[discretized_continuous]]
                )

                # Calculate the transformation for each group within the categorical column
                temp["new_column"] = temp.groupby(discretized_continuous)[
                    transformed_continuous
                ].transform(transform_type)

                # Run CatBoost Cross-Validation with the transformed data
                mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)

                # Store the results as a tuple
                result = (
                    mean_OOF,
                    std_OOF,
                    discretized_continuous,
                    transformed_continuous,
                )
                results.append(result)

                del temp, mean_OOF, std_OOF

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results,
        columns=[
            "mean_OOFs",
            "std_OOFs",
            "discretized_continuous",
            "transformed_continuous",
        ],
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

Code

%%script echo skipping

FE_continuous_transform_mean = feature_engineering.FE_continuous_transform(
    X_wo_outliers, y_wo_outliers
)

FE_continuous_transform_mean.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_continuous_transform_mean")}.parquet.gzip',
    compression="gzip",
)

FE_categorical_transform_mean.head(15)

Couldn't find program: 'echo'

This approach was not as effective as our prior method. However, combining bathrooms with yearly_theoretical_total_energy_consumption yielded the best outcome.

Code

pd.read_parquet(
    Path.cwd().joinpath("data").joinpath("FE_continuous_transform_mean.parquet.gzip")
).head(10)

	mean_OOFs	std_OOFs	discretized_continuous	transformed_continuous
55	0.109328	0.005094	bathrooms	yearly_theoretical_total_energy_consumption
59	0.109328	0.005094	bathrooms	living_area
58	0.109328	0.005094	bathrooms	cadastral_income
57	0.109328	0.005094	bathrooms	lat
56	0.109328	0.005094	bathrooms	surface_of_the_plot
50	0.109328	0.005094	bathrooms	bedrooms
52	0.109328	0.005094	bathrooms	toilets
39	0.109417	0.004831	lng	living_area
1	0.109426	0.004587	bedrooms	toilets
4	0.109426	0.004587	bedrooms	bathrooms

Introduce polynomial features

The idea behind introducing polynomial features is to capture non-linear relationships within the data. By raising individual features to higher powers or considering interactions between pairs of features, this step allows the model to better represent complex patterns that cannot be adequately expressed with linear relationships alone. It enhances the model’s ability to learn and predict outcomes that exhibit curvilinear or interactive behavior.

Code

def FE_polynomial_features(
    X: pd.DataFrame, y: pd.Series, combinations: int = 1
) -> pd.DataFrame:
    """
    Generate polynomial features for combinations of numerical columns and train a CatBoost model.

    Parameters:
        X (pd.DataFrame): The input DataFrame with features.
        y (pd.Series): The target variable.
        combinations (int, optional): The number of combinations of numerical columns. Default is 1.

    Returns:
        pd.DataFrame: A DataFrame containing results sorted by mean OOF scores.

    Example:
        X_wo_outliers = pd.DataFrame(...)  # Your input data
        y_wo_outliers = pd Series(...)  # Your target variable
        result = FE_polynomial_features(X_wo_outliers, y_wo_outliers)

    Transformations:
        - Imputes missing values in numerical columns using the median.
        - Generates polynomial features, including interaction terms, for selected numerical columns.
        - Trains a CatBoost model and calculates mean and standard deviation of out-of-fold (OOF) scores.
    """

    # Initialize a list to store results
    results = []

    # Get a list of continuous and numerical columns
    numerical_columns = X.select_dtypes("number").columns

    # Combine the loops to have a single progress bar
    for numerical_col in tqdm(
        list(itertools.combinations(numerical_columns, r=combinations))
    ):
        polyfeatures = compose.make_column_transformer(
            (
                pipeline.make_pipeline(
                    impute.SimpleImputer(strategy="median"),
                    preprocessing.PolynomialFeatures(interaction_only=False),
                ),
                list(numerical_col),
            ),
            remainder="passthrough",
        ).set_output(transform="pandas")

        temp = polyfeatures.fit_transform(X)
        mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)

        # Store the results as a tuple
        result = (
            mean_OOF,
            std_OOF,
            numerical_col,
        )
        results.append(result)

        del temp, mean_OOF, std_OOF

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results,
        columns=[
            "mean_OOFs",
            "std_OOFs",
            "numerical_col",
        ],
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

n=1

Let’s see the impact of applying polynomial feature engineering to a single feature.

Code

%%script echo skipping

FE_polynomial_features_combinations_1 = FE_polynomial_features(
    X_wo_outliers, y_wo_outliers
)

FE_polynomial_features_combinations_1.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_1")}.parquet.gzip',
    compression="gzip",
)

FE_polynomial_features_combinations_1.head(15)

Couldn't find program: 'echo'

Code

pd.read_parquet(
    Path.cwd()
    .joinpath("data")
    .joinpath("FE_polynomial_features_combinations_1.parquet.gzip")
).head(10)

	mean_OOFs	std_OOFs	numerical_col
10	0.109938	0.005047	[living_area]
9	0.110339	0.004012	[cadastral_income]
3	0.110628	0.004018	[lng]
0	0.111066	0.004765	[bedrooms]
8	0.111099	0.005039	[lat]
4	0.111166	0.004879	[primary_energy_consumption]
6	0.111271	0.004908	[yearly_theoretical_total_energy_consumption]
1	0.111276	0.005359	[number_of_frontages]
7	0.111332	0.004815	[surface_of_the_plot]
2	0.111782	0.004741	[toilets]

n=2

How about two features combined…

Code

%%script echo skipping

FE_polynomial_features_combinations_2 = FE_polynomial_features(
    X_wo_outliers, y_wo_outliers, combinations=2
)

FE_polynomial_features_combinations_2.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_2")}.parquet.gzip',
    compression="gzip",
)

FE_polynomial_features_combinations_2.head(15)

Couldn't find program: 'echo'

Code

pd.read_parquet(
    Path.cwd()
    .joinpath("data")
    .joinpath("FE_polynomial_features_combinations_2.parquet.gzip")
).head(10)

	mean_OOFs	std_OOFs	numerical_col
31	0.109413	0.004690	[lng, lat]
9	0.109625	0.005558	[bedrooms, living_area]
53	0.109809	0.005321	[lat, living_area]
52	0.109814	0.003485	[lat, cadastral_income]
7	0.109847	0.005399	[bedrooms, lat]
19	0.109962	0.004999	[toilets, lng]
17	0.110057	0.004554	[number_of_frontages, cadastral_income]
42	0.110124	0.004511	[bathrooms, lat]
46	0.110128	0.004944	[yearly_theoretical_total_energy_consumption, ...
28	0.110154	0.004644	[lng, bathrooms]

Form clusters of instances using k-means clustering

The idea behind using k-means clustering in feature engineering is to group data points into clusters based on their similarity. By doing so, we create a new set of features that represent these clusters, which can capture patterns or relationships within the data that might be less apparent in the original features. These cluster features can be valuable for machine learning models, as they provide a more compact and informative representation of the data, potentially improving predictive performance.

Code

class FeatureSelector(BaseEstimator, TransformerMixin):
    """
    A transformer for selecting specific columns from a DataFrame.

    This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base.
    It overrides the fit and transform methods from the parent classes.

    Attributes:
        feature_names_in_ (list): The names of the features to select.
        n_features_in_ (int): The number of features to select.

    Methods:
        fit(X, y=None): Fit the transformer. Returns self.
        transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features.
    """

    def __init__(self, feature_names_in_):
        """
        Constructs all the necessary attributes for the FeatureSelector object.

        Args:
            feature_names_in_ (list): The names of the features to select.
        """
        self.feature_names_in_ = feature_names_in_
        self.n_features_in_ = len(feature_names_in_)

    def fit(self, X, y=None):
        """
        Fit the transformer. This method doesn't do anything as no fitting is necessary.

        Args:
            X (DataFrame): The input data.
            y (array-like, optional): The target variable. Defaults to None.

        Returns:
            self: The instance itself.
        """
        return self

    def transform(self, X, y=None):
        """
        Apply the transformation. Selects the features from the input data.

        Args:
            X (DataFrame): The input data.
            y (array-like, optional): The target variable. Defaults to None.

        Returns:
            DataFrame: A DataFrame with only the selected features.
        """
        return X.loc[:, self.feature_names_in_].copy(deep=True)

Code

def FE_KMeans(
    X: pd.DataFrame,
    y: pd.Series,
    n_clusters_min: int = 1,
    n_clusters_max: int = 8,
) -> pd.DataFrame:
    """Performs K-Means clustering-based feature engineering followed by model training.

    Args:
        X (pd.DataFrame): The input feature matrix.
        y (pd.Series): The target variable.
        n_clusters_min (int, optional): The minimum number of clusters to consider. Defaults to 1.
        n_clusters_max (int, optional): The maximum number of clusters to consider. Defaults to 8.

    Returns:
        pd.DataFrame: A DataFrame containing the results of feature engineering with K-Means clustering.

    Example:
        >>> results_df = FE_KNN(X_wo_outliers, y_wo_outliers)
    """
    # Initialize a list to store results
    results = []

    # Get a list of continuous and numerical columns
    numerical_columns = X.head().select_dtypes("number").columns.to_list()
    categorical_columns = X.head().select_dtypes("object").columns.to_list()

    for n_cluster in tqdm(range(n_clusters_min, n_clusters_max)):
        # Prepare pipelines for corresponding columns:
        numerical_pipeline = pipeline.Pipeline(
            steps=[
                ("num_selector", FeatureSelector(numerical_columns)),
                ("imputer", impute.SimpleImputer(strategy="median")),
            ]
        )

        categorical_pipeline = pipeline.Pipeline(
            steps=[
                ("cat_selector", FeatureSelector(categorical_columns)),
                ("imputer", impute.SimpleImputer(strategy="most_frequent")),
                (
                    "onehot",
                    preprocessing.OneHotEncoder(
                        handle_unknown="ignore", sparse_output=False
                    ),
                ),
            ]
        )

        # Put all the pipelines inside a FeatureUnion:
        data_preprocessing_pipeline = pipeline.FeatureUnion(
            n_jobs=-1,
            transformer_list=[
                ("numerical_pipeline", numerical_pipeline),
                ("categorical_pipeline", categorical_pipeline),
            ],
        )

        temp = pd.DataFrame(data_preprocessing_pipeline.fit_transform(X))

        KMeans = cluster.KMeans(n_init=10, n_clusters=n_cluster)
        KMeans.fit_transform(temp)

        groups = pd.Series(KMeans.labels_, name="groups")

        concatanated_df = pd.concat([temp, groups], axis="columns")

        mean_OOF, std_OOF = train_model.run_catboost_CV(X=concatanated_df, y=y)

        # Store the results as a tuple
        result = (
            mean_OOF,
            std_OOF,
            n_cluster,
        )
        results.append(result)

        del temp, mean_OOF, std_OOF, KMeans, groups, concatanated_df, result

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results,
        columns=[
            "mean_OOFs",
            "std_OOFs",
            "n_cluster",
        ],
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

Code

%%script echo skipping

FE_KMeans_df = FE_KMeans(
    X_wo_outliers,
    y_wo_outliers,
    n_clusters_min=1,
    n_clusters_max=101,
)

FE_KMeans_df.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_KNN_df")}.parquet.gzip',
    compression="gzip",
)

FE_KNN_df.head(15)

Couldn't find program: 'echo'

As k-means clustering is an unsupervised algorithm, determining the appropriate k-values requires testing various values to assess their impact on our validation scores. As observed, this approach didn’t yield significant results in our case, as the best validation score was obtained when n=1.

Code

pd.read_parquet(Path.cwd().joinpath("data").joinpath("FE_KNN_df.parquet.gzip")).head(10)

	mean_OOFs	std_OOFs	n_cluster
0	0.111884	0.005143	1
1	0.112153	0.006174	2
5	0.112264	0.005858	6
24	0.112313	0.005602	25
7	0.112326	0.005124	8
43	0.112333	0.005819	44
31	0.112352	0.005588	32
11	0.112366	0.005206	12
95	0.112454	0.006621	96
26	0.112472	0.006096	27

Implement other ideas derived from empirical observations or assumptions

Though new features can be generated through systematic methods, domain knowledge can also inspire their creation. The idea behind this is to allow for the incorporation of unconventional or domain-specific insights that may not fit standard feature engineering techniques. It encourages the exploration of novel features or transformations based on practical experiences or theoretical assumptions to potentially uncover hidden patterns or relationships within the data. This open-ended approach can lead to creative and tailored feature engineering solutions.

Here are some ideas to consider:

Geospatial Features:
- Create clusters or neighborhoods based on features to capture similarities.
Area-related Features:
- Calculate the ratio of “living_area” to “surface_of_the_plot” to get an idea of the density or spaciousness of the property.
Energy Efficiency Features:
- Compute the energy efficiency ratio by dividing “yearly_theoretical_total_energy_consumption” by “primary_energy_consumption.”
- Compute energy efficiency by dividing primary_energy_consumption with living_area
Toilet and Bathroom Features:
- Combine “toilets” and “bathrooms” into a single “total_bathrooms” feature to simplify the model.
- Calculate total number of rooms by adding up bedrooms + toilets + bathrooms
Taxation Features:
- Incorporate “cadastral_income” as a measure of property value for taxation. You can create bins or categories for this variable.
Value for Money:
- Divide cadastral_income by bedrooms to see if the property is a good bargain
- similarly, Divide cadastral_income by living_area

Code

def FE_ideas(X):
    """Performs additional feature engineering on the input DataFrame.

    Args:
        X (pd.DataFrame): The input DataFrame containing the original features.

    Returns:
        pd.DataFrame: A DataFrame with additional engineered features.

    Example:
        >>> engineered_data = FE_ideas(original_data)
    """
    temp = X.assign(
        energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption
        / df.primary_energy_consumption,
        energy_efficiency_2=lambda df: df.primary_energy_consumption / df.living_area,
        total_bathrooms=lambda df: df.toilets + df.bathrooms,
        total_number_rooms=lambda df: df.toilets + df.bathrooms + df.bedrooms,
        spaciousness_1=lambda df: df.living_area / df.surface_of_the_plot,
        spaciousness_2=lambda df: df.living_area / df.total_number_rooms,
        bargain_1=lambda df: df.cadastral_income / df.bedrooms,
        bargain_2=lambda df: df.cadastral_income / df.living_area,
    )
    return temp.loc[:, "energy_efficiency_1":]

Code

def FE_try_ideas(
    X: pd.DataFrame,
    y: pd.Series,
) -> pd.DataFrame:
    """Performs feature engineering experiments by adding new features and evaluating their impact on model performance.

    Args:
        X (pd.DataFrame): The input feature matrix.
        y (pd.Series): The target variable.

    Returns:
        pd.DataFrame: A DataFrame containing the results of feature engineering experiments.

    Example:
        >>> results_df = FE_try_ideas(X, y)
    """
    # Initialize a list to store results
    results = []

    # Get a list of continuous and numerical columns
    numerical_columns = X.select_dtypes("number").columns

    # Apply additional feature engineering ideas
    feature_df = FE_ideas(X)

    for feature in tqdm(feature_df.columns):
        # Concatenate the original features with the newly engineered feature
        temp = pd.concat([X, feature_df[feature]], axis="columns")

        # Train the model with the augmented features and get the mean and standard deviation of OOF scores
        mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)

        # Store the results as a tuple
        result = (
            mean_OOF,
            std_OOF,
            feature,
        )
        results.append(result)

        del temp, mean_OOF, std_OOF

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results,
        columns=[
            "mean_OOFs",
            "std_OOFs",
            "feature",
        ],
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

Code

%%script echo skipping

FE_try_ideas = FE_try_ideas(X_wo_outliers, y_wo_outliers)

FE_try_ideas.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_try_ideas")}.parquet.gzip',
    compression="gzip",
)

FE_try_ideas

Couldn't find program: 'echo'

As can be seen below, the best feature this time was spaciousness_1, representing df.living_area divided by df.surface_of_the_plot.

Code

pd.read_parquet(Path.cwd().joinpath("data").joinpath("FE_try_ideas.parquet.gzip"))

	mean_OOFs	std_OOFs	feature
4	0.109305	0.004652	spaciousness_1
7	0.109558	0.004372	bargain_2
3	0.109969	0.005117	total_number_rooms
6	0.109976	0.004303	bargain_1
2	0.110545	0.004884	total_bathrooms
5	0.110603	0.004715	spaciousness_2
1	0.110666	0.005749	energy_efficiency_2
0	0.110722	0.005120	energy_efficiency_1

Summary table of the tested conditions

The initial model achieved the best mean out-of-folds score of 0.1107. However, we made modifications to expedite training by reducing iterations to 100 and increasing the learning rate to 0.2, resulting in a new baseline model with a score of 0.1105 after outlier removal. This serves as our reference point to assess the impact of various feature engineering techniques.

Subsequent feature engineering approaches, including utilizing categorical columns for groupby and transformation, creating bins from continuous data, and implementing other ideas, led to marginal score improvements, with the lowest at 0.1089. Polynomial features, with n=2 and n=1, demonstrated slightly higher scores of 0.1094 and 0.1099, respective.

Now, we will proceed to assess the efficacy of two of the best approaches, namely: utilizing categorical columns for groupby and transformation and implementing additional ideas. We will conduct this evaluation using CatBoost’s built-in select_features as outlined in part 3. Let’s dive in…

Condition	Best mean OOFs	std OOFs
*Original*	*0.1107*	NA
Use categorical columns for groupby/transform	0.1089	0.0062
Create bins from continuous data and use groupby/transform	0.1093	0.0050
Implementing the rest of the ideas	0.1093	0.0046
Polynomial features (n=2)	0.1094	0.0046
Polynomial features (n=1)	0.1099	0.0050
After Outlier filter	0.1105	0.0045
k-means clustering	0.1118	0.0051
Sped up version	0.1125	0.0044

Final feature selection

Code

def prepare_df_for_final_feature_selection(X):
    return X.assign(
        city_group=lambda df: df.groupby("city")["cadastral_income"].transform(
            "median"
        ),
        building_condition_group=lambda df: df.groupby("building_condition")[
            "yearly_theoretical_total_energy_consumption"
        ].transform("median"),
        energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption
        / df.primary_energy_consumption,
        energy_efficiency_2=lambda df: df.primary_energy_consumption / df.living_area,
        total_bathrooms=lambda df: df.toilets + df.bathrooms,
        total_number_rooms=lambda df: df.toilets + df.bathrooms + df.bedrooms,
        spaciousness_1=lambda df: df.living_area / df.surface_of_the_plot,
        spaciousness_2=lambda df: df.living_area / df.total_number_rooms,
        bargain_1=lambda df: df.cadastral_income / df.bedrooms,
        bargain_2=lambda df: df.cadastral_income / df.living_area,
    )


X_final_feature_selection = prepare_df_for_final_feature_selection(X_wo_outliers)

Code

X_train, X_val, y_train, y_val = model_selection.train_test_split(
    X_final_feature_selection,
    y_wo_outliers,
    test_size=0.2,
    random_state=utils.Configuration.seed,
)

Code

regressor = catboost.CatBoostRegressor(
    iterations=1000,
    cat_features=X_final_feature_selection.select_dtypes("object").columns.to_list(),
    random_seed=utils.Configuration.seed,
    loss_function="RMSE",
)

rfe_dict = regressor.select_features(
    algorithm="RecursiveByShapValues",
    shap_calc_type="Exact",
    X=X_train,
    y=y_train,
    eval_set=(X_val, y_val),
    features_for_select="0-25",
    num_features_to_select=1,
    steps=20,
    verbose=2000,
    train_final_model=False,
    plot=False,
)

Learning rate set to 0.059655
Step #1 out of 20
0:  learn: 0.3071998    test: 0.3029816 best: 0.3029816 (0) total: 13.6ms   remaining: 13.6s
999:    learn: 0.0482570    test: 0.1076466 best: 0.1075005 (964)   total: 14.9s    remaining: 0us

bestTest = 0.1075004858
bestIteration = 964

Shrink model to first 965 iterations.
Feature #21 eliminated
Feature #20 eliminated
Feature #23 eliminated
Feature #2 eliminated
Step #2 out of 20
0:  learn: 0.3070684    test: 0.3029727 best: 0.3029727 (0) total: 14.1ms   remaining: 14.1s
999:    learn: 0.0497732    test: 0.1094414 best: 0.1092710 (833)   total: 14.1s    remaining: 0us

bestTest = 0.1092709663
bestIteration = 833

Shrink model to first 834 iterations.
Feature #4 eliminated
Feature #22 eliminated
Feature #24 eliminated
Step #3 out of 20
0:  learn: 0.3058962    test: 0.3017255 best: 0.3017255 (0) total: 14.2ms   remaining: 14.1s
999:    learn: 0.0515838    test: 0.1074155 best: 0.1074083 (998)   total: 13.9s    remaining: 0us

bestTest = 0.1074082848
bestIteration = 998

Shrink model to first 999 iterations.
Feature #12 eliminated
Feature #0 eliminated
Feature #7 eliminated
Step #4 out of 20
0:  learn: 0.3062525    test: 0.3017927 best: 0.3017927 (0) total: 14.5ms   remaining: 14.4s
999:    learn: 0.0549456    test: 0.1092524 best: 0.1092519 (996)   total: 12.8s    remaining: 0us

bestTest = 0.1092518693
bestIteration = 996

Shrink model to first 997 iterations.
Feature #9 eliminated
Feature #3 eliminated
Step #5 out of 20
0:  learn: 0.3069279    test: 0.3031549 best: 0.3031549 (0) total: 20.9ms   remaining: 20.9s
999:    learn: 0.0566761    test: 0.1092288 best: 0.1091824 (846)   total: 12.5s    remaining: 0us

bestTest = 0.1091824436
bestIteration = 846

Shrink model to first 847 iterations.
Feature #6 eliminated
Feature #11 eliminated
Step #6 out of 20
0:  learn: 0.3065169    test: 0.3023901 best: 0.3023901 (0) total: 16.8ms   remaining: 16.8s
999:    learn: 0.0567811    test: 0.1114134 best: 0.1113041 (981)   total: 14.9s    remaining: 0us

bestTest = 0.1113040714
bestIteration = 981

Shrink model to first 982 iterations.
Feature #1 eliminated
Feature #25 eliminated
Step #7 out of 20
0:  learn: 0.3063087    test: 0.3019867 best: 0.3019867 (0) total: 15.1ms   remaining: 15.1s
999:    learn: 0.0554371    test: 0.1123259 best: 0.1122143 (920)   total: 31s  remaining: 0us

bestTest = 0.1122143004
bestIteration = 920

Shrink model to first 921 iterations.
Feature #5 eliminated
Feature #8 eliminated
Step #8 out of 20
0:  learn: 0.3068859    test: 0.3026476 best: 0.3026476 (0) total: 3.82ms   remaining: 3.81s
999:    learn: 0.0574991    test: 0.1152245 best: 0.1152077 (992)   total: 2.1s remaining: 0us

bestTest = 0.1152077068
bestIteration = 992

Shrink model to first 993 iterations.
Feature #13 eliminated
Step #9 out of 20
0:  learn: 0.3068033    test: 0.3025327 best: 0.3025327 (0) total: 2.82ms   remaining: 2.81s
999:    learn: 0.0640641    test: 0.1201366 best: 0.1201130 (689)   total: 2.13s    remaining: 0us

bestTest = 0.1201130073
bestIteration = 689

Shrink model to first 690 iterations.
Feature #19 eliminated
Step #10 out of 20
0:  learn: 0.3069152    test: 0.3030754 best: 0.3030754 (0) total: 1.96ms   remaining: 1.96s
999:    learn: 0.0703171    test: 0.1242382 best: 0.1236966 (778)   total: 2.13s    remaining: 0us

bestTest = 0.1236965761
bestIteration = 778

Shrink model to first 779 iterations.
Feature #18 eliminated
Step #11 out of 20
0:  learn: 0.3073732    test: 0.3033657 best: 0.3033657 (0) total: 2.05ms   remaining: 2.05s
999:    learn: 0.0750814    test: 0.1247368 best: 0.1246789 (969)   total: 2.14s    remaining: 0us

bestTest = 0.1246789365
bestIteration = 969

Shrink model to first 970 iterations.
Feature #10 eliminated
Step #12 out of 20
0:  learn: 0.3073693    test: 0.3032811 best: 0.3032811 (0) total: 1.83ms   remaining: 1.83s
999:    learn: 0.0914355    test: 0.1353158 best: 0.1351822 (880)   total: 2.25s    remaining: 0us

bestTest = 0.1351822259
bestIteration = 880

Shrink model to first 881 iterations.
Step #13 out of 20
0:  learn: 0.3073693    test: 0.3032811 best: 0.3032811 (0) total: 1.74ms   remaining: 1.74s
999:    learn: 0.0914355    test: 0.1353158 best: 0.1351822 (880)   total: 2.4s remaining: 0us

bestTest = 0.1351822259
bestIteration = 880

Shrink model to first 881 iterations.
Feature #17 eliminated
Step #14 out of 20
0:  learn: 0.3071755    test: 0.3033561 best: 0.3033561 (0) total: 1.86ms   remaining: 1.86s
999:    learn: 0.1071168    test: 0.1503970 best: 0.1496295 (720)   total: 2.19s    remaining: 0us

bestTest = 0.1496294939
bestIteration = 720

Shrink model to first 721 iterations.
Step #15 out of 20
0:  learn: 0.3071755    test: 0.3033561 best: 0.3033561 (0) total: 2.74ms   remaining: 2.74s
999:    learn: 0.1071168    test: 0.1503970 best: 0.1496295 (720)   total: 2.27s    remaining: 0us

bestTest = 0.1496294939
bestIteration = 720

Shrink model to first 721 iterations.
Feature #14 eliminated
Step #16 out of 20
0:  learn: 0.3069654    test: 0.3029158 best: 0.3029158 (0) total: 6.46ms   remaining: 6.45s
999:    learn: 0.1272584    test: 0.1579738 best: 0.1574747 (784)   total: 2.15s    remaining: 0us

bestTest = 0.1574747019
bestIteration = 784

Shrink model to first 785 iterations.
Step #17 out of 20
0:  learn: 0.3069654    test: 0.3029158 best: 0.3029158 (0) total: 2.92ms   remaining: 2.92s
999:    learn: 0.1272584    test: 0.1579738 best: 0.1574747 (784)   total: 2.2s remaining: 0us

bestTest = 0.1574747019
bestIteration = 784

Shrink model to first 785 iterations.
Step #18 out of 20
0:  learn: 0.3069654    test: 0.3029158 best: 0.3029158 (0) total: 1.77ms   remaining: 1.77s
999:    learn: 0.1272584    test: 0.1579738 best: 0.1574747 (784)   total: 2.15s    remaining: 0us

bestTest = 0.1574747019
bestIteration = 784

Shrink model to first 785 iterations.
Feature #16 eliminated
Step #19 out of 20
0:  learn: 0.3099541    test: 0.3058748 best: 0.3058748 (0) total: 4.3ms    remaining: 4.3s
999:    learn: 0.2075860    test: 0.2206064 best: 0.2148488 (114)   total: 2.06s    remaining: 0us

bestTest = 0.2148488447
bestIteration = 114

Shrink model to first 115 iterations.
Step #20 out of 20
0:  learn: 0.3099541    test: 0.3058748 best: 0.3058748 (0) total: 4.43ms   remaining: 4.43s
999:    learn: 0.2075860    test: 0.2206064 best: 0.2148488 (114)   total: 2.13s    remaining: 0us

bestTest = 0.2148488447
bestIteration = 114

Shrink model to first 115 iterations.

Based on our evaluation, it is recommended to retain the following features:

‘city_group’
‘building_condition_group’
‘energy_efficiency_1’
‘energy_efficiency_2’
‘bargain_1’
‘bargain_2’

However, we should remove the following two features since our analysis indicates that better features have been incorporated:

‘kitchen_type’
‘toilets’

These feature selections should help optimize our model even further.

Based on these insights, we crafted the prepare_data_for_modelling function, which is stored in the pre_process.py file. This function includes the feature engineering steps we discussed, setting the stage for effective modeling performance.

Code

def prepare_data_for_modelling(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Prepare data for machine learning modeling.

    This function takes a DataFrame and prepares it for machine learning by performing the following steps:
    1. Randomly shuffles the rows of the DataFrame.
    2. Converts the 'price' column to the base 10 logarithm.
    3. Fills missing values in categorical variables with 'missing value'.
    4. Separates the features (X) and the target (y).
    5. Identifies and filters out outlier values based on LocalOutlierFactor.

    Parameters:
    - df (pd.DataFrame): The input DataFrame containing the dataset.

    Returns:
    - Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y).

    Example use case:
    # Load your dataset into a DataFrame (e.g., df)
    df = load_data()

    # Prepare the data for modeling
    X, y = prepare_data_for_modelling(df)

    # Now you can use X and y for machine learning tasks.

    Args:
        df (pd.DataFrame): The input DataFrame containing the dataset.

    Returns:
        Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y).
    """

    processed_df = (
        df.sample(frac=1, random_state=utils.Configuration.seed)
        .reset_index(drop=True)
        .assign(
            price=lambda df: np.log10(df.price),
            city_group=lambda df: df.groupby("city")["cadastral_income"].transform(
                "median"
            ),
            building_condition_group=lambda df: df.groupby("building_condition")[
                "yearly_theoretical_total_energy_consumption"
            ].transform("median"),
            energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption
            / df.primary_energy_consumption,
            energy_efficiency_2=lambda df: df.primary_energy_consumption
            / df.living_area,
            bargain_1=lambda df: df.cadastral_income / df.bedrooms,
            bargain_2=lambda df: df.cadastral_income / df.living_area,
        )
    )

    # Fill missing categorical variables with "missing value"
    for col in processed_df.columns:
        if processed_df[col].dtype.name in ("bool", "object", "category"):
            processed_df[col] = processed_df[col].fillna("missing value")

    # Separate features (X) and target (y)
    X = processed_df.loc[:, utils.Configuration.features_to_keep_v2]
    y = processed_df[utils.Configuration.target_col]

    outlier_mask = pre_process.identify_outliers(X)

    X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True)
    y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)

    print(f"Shape of X and y with outliers: {X.shape}, {y.shape}")
    print(
        f"Shape of X and y without outliers: {X_wo_outliers.shape}, {y_wo_outliers.shape}"
    )

    return X_wo_outliers, y_wo_outliers

Code

X, y = pre_process.prepare_data_for_modelling(df)

Shape of X and y with outliers: (3660, 14), (3660,)
Shape of X and y without outliers: (3427, 14), (3427,)

In Part 4, we began with an initial selection of 16 features based on the work in Part 3. However, as we conclude this article, we’ve found that we can streamline our feature set even further by removing kitchen_type and toilets resulting in improved performance, thanks to the addition of new features. While there’s potential for further optimizations, such as dimensional reduction, we are currently happy with our progress. In the next, and final, part, we will focus on fine-tuning our model for optimal predictive performance. See you there!