Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

In Part 5, we looked at the significance of features in the initial scraped dataset using both the feature_importances_ method of CatBoostRegressor and SHAP values. We conducted feature elimination based on their importance and predictive capability.

In this upcoming phase, we’ll implement a robust cross-validation strategy to accurately and consistently evaluate our model’s performance across multiple folds of the dataWe will also i Identing and addreng potential outliers within our datas, whichet is crucial to prevent their undue influence on the model’s predictions.

Additionally, we’ll further refine and expand our feature engineering efforts by exploring new methodologies to create informative features that bolster our model’s predictive capabilities. Looking forward to these pivotal steps!

Note

You can access the project’s app through its Streamlit website.

Import data

Code

import gc
import itertools
from pathlib import Path
from typing import List, Optional, Tuple

import catboost
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shap
from data import pre_process, utils
from features import feature_engineering
from IPython.display import clear_output
from lets_plot import *
from lets_plot.mapping import as_discrete
from models import train_model
from sklearn import (
    cluster,
    compose,
    ensemble,
    impute,
    metrics,
    model_selection,
    neighbors,
    pipeline,
    preprocessing,
)
from sklearn.base import BaseEstimator, TransformerMixin
from tqdm.notebook import tqdm

LetsPlot.setup_html()

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)

Prepare dataframe before modelling

Read in dataframe

Drawing from our findings in notebook 4, particularly with regards to our initial feature reduction efforts, we’ve developed a function named “prepare_data_for_modelling.” This function resides in the pre_process.py file, ensuring its reusability. The function performs essential data preprocessing steps, which include:

Randomly shuffling the rows in the DataFrame.
Transforming the ‘price’ column by taking the base 10 logarithm.
Handling missing values in categorical variables by replacing them with ‘missing value.’
Separating the dataset into features (X) and the target variable (y).

Let’s dive into the details of this function and prepare our X and y for the subsequent processing pipeline.

Code

df = pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "2023-10-01_Processed_dataset_for_NB_use.parquet.gzip"
    )
)

Code

def prepare_data(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Prepare data for machine learning modeling.

    This function takes a DataFrame and prepares it for machine learning by performing the following steps:
    1. Randomly shuffles the rows of the DataFrame.
    2. Converts the 'price' column to the base 10 logarithm.
    3. Fills missing values in categorical variables with 'missing value'.
    4. Separates the features (X) and the target (y).

    Parameters:
    - df (pd.DataFrame): The input DataFrame containing the dataset.

    Returns:
    - Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y).

    Example use case:
    # Load your dataset into a DataFrame (e.g., df)
    df = load_data()

    # Prepare the data for modeling
    X, y = prepare_data_for_modelling(df)

    # Now you can use X and y for machine learning tasks.
    """

    processed_df = (
        df.sample(frac=1, random_state=utils.Configuration.seed)
        .reset_index(drop=True)
        .assign(price=lambda df: np.log10(df.price))
    )

    # Fill missing categorical variables with "missing value"
    for col in processed_df.columns:
        if processed_df[col].dtype.name in ("bool", "object", "category"):
            processed_df[col] = processed_df[col].fillna("missing value")

    # Separate features (X) and target (y)
    X = processed_df.loc[:, utils.Configuration.features_to_keep_v1]
    y = processed_df[utils.Configuration.target_col]

    print(f"Shape of X and y: {X.shape}, {y.shape}")

    return X, y

Code

X, y = prepare_data(df)

Shape of X and y: (3660, 16), (3660,)

Cross-validation strategy

Our next critical step is to establish a well-structured cross-validation strategy. This step is imperative as it enables us to assess the effectiveness of various feature engineering approaches without risking overfitting our model. A robust cross-validation strategy ensures that our model’s performance evaluations are reliable and that the insights gained are generalizable to new data. To accomplish this, we will employ RepeatedKFold validation, setting the parameters with n_splits as 10 and n_repeats as 1.

In essence, this configuration signifies that we will perform a 10-fold cross-validation, and this entire process will be repeated once. Importantly, due to the modular nature of this function, we retain the flexibility to easily adapt and alter the design of our cross-validation strategy as needed.

Code

def run_catboost_CV(
    X: pd.DataFrame,
    y: pd.Series,
    n_splits: int = 10,
    n_repeats: int = 1,
    pipeline: Optional[object] = None,
) -> Tuple[float, float]:
    """
    Perform Cross-Validation with CatBoost for regression.

    This function conducts Cross-Validation using CatBoost for regression tasks. It iterates
    through folds, trains CatBoost models, and computes the mean and standard deviation of the
    Root Mean Squared Error (RMSE) scores across folds.

    Parameters:
    - X (pd.DataFrame): The feature matrix.
    - y (pd.Series): The target variable.
    - n_splits (int, optional): The number of splits in K-Fold cross-validation.
      Defaults to 2.
    - n_repeats (int, optional): The number of times the K-Fold cross-validation is repeated.
      Defaults to 1.
    - pipeline (object, optional): Optional data preprocessing pipeline. If provided,
      it's applied to the data before training the model. Defaults to None.

    Returns:
    - Tuple[float, float]: A tuple containing the mean RMSE and standard deviation of RMSE
      scores across cross-validation folds.

    Example:
    # Load your feature matrix (X) and target variable (y)
    X, y = load_data()

    # Perform Cross-Validation with CatBoost
    mean_rmse, std_rmse = run_catboost_CV(X, y, n_splits=5, n_repeats=2, pipeline=data_pipeline)

    print(f"Mean RMSE: {mean_rmse:.4f}")
    print(f"Standard Deviation of RMSE: {std_rmse:.4f}")

    Notes:
    - Ensure that the input data `X` and `y` are properly preprocessed and do not contain any
      missing values.
    - The function uses CatBoost for regression with optional data preprocessing via the `pipeline`.
    - RMSE is a common metric for regression tasks, and lower values indicate better model
      performance.
    """
    results = []

    # Extract feature names and data types
    features = X.columns[~X.columns.str.contains("price")]
    numerical_features = X.select_dtypes("number").columns.to_list()
    categorical_features = X.select_dtypes("object").columns.to_list()

    # Create a K-Fold cross-validator
    CV = model_selection.RepeatedKFold(
        n_splits=n_splits, n_repeats=n_repeats, random_state=utils.Configuration.seed
    )

    for train_fold_index, val_fold_index in tqdm(CV.split(X)):
        X_train_fold, X_val_fold = X.loc[train_fold_index], X.loc[val_fold_index]
        y_train_fold, y_val_fold = y.loc[train_fold_index], y.loc[val_fold_index]

        # Apply optional data preprocessing pipeline
        if pipeline is not None:
            X_train_fold = pipeline.fit_transform(X_train_fold)
            X_val_fold = pipeline.transform(X_val_fold)

        # Create CatBoost datasets
        catboost_train = Pool(
            X_train_fold,
            y_train_fold,
            cat_features=categorical_features,
        )
        catboost_valid = Pool(
            X_val_fold,
            y_val_fold,
            cat_features=categorical_features,
        )

        # Initialize and train the CatBoost model
        model = catboost.CatBoostRegressor(**utils.Configuration.catboost_params)
        model.fit(
            catboost_train,
            eval_set=[catboost_valid],
            early_stopping_rounds=utils.Configuration.early_stopping_round,
            verbose=utils.Configuration.verbose,
            use_best_model=True,
        )

        # Calculate OOF validation predictions
        valid_pred = model.predict(X_val_fold)

        RMSE_score = metrics.mean_squared_error(y_val_fold, valid_pred, squared=False)

        results.append(RMSE_score)

    return np.mean(results), np.std(results)

Now, let’s proceed to train our model with the updated settings:

Code

train_model.run_catboost_CV(X, y)

(0.11251233080551612, 0.004459362099695207)

Note

Note that we’ve reduced the number of iterations in Notebook 6 compared to Notebook 5 to minimize the training duration. In Notebook 5: iterations = 1000, default learning rate = 0.03 In Notebook 6: iterations = 100, learning rate = 0.2

The performance of the baseline model: 0.1125

Outlier detection

An outlier is a data point that significantly differs from the rest of the data. One common way to define an outlier is a data point that falls more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile. Detecting and removing outliers from the dataset is crucial for building a stable model that can effectively generalize to new data.

When we create a scatter plot of our features (as shown in Figure 1), such as cadastral income against living area, and adjust the points’ color and size based on price, we can identify at least two data points that notably deviate from the expected range of values. One data point suggests a 300 m2 property is associated with a cadastral income exceeding 320,000 EURO, while the other point indicates a 2,500 EUR cadastral income for an 11,000 m2 property. Both observations seem implausible when compared to the majority of data points on the graph.

Code

pd.concat([X, y], axis=1).pipe(
    lambda df: ggplot(
        df, aes("cadastral_income", "living_area", fill="price", size="price")
    )
    + geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)
    + scale_fill_continuous(low="#1a9641", high="#d7191c")
    + labs(
        title="Assessing Potential Outliers",
        subtitle=""" Outliers pose a challenge for gradient boosting methods since boosting constructs each tree based on the errors of the previous trees. 
        Outliers, having significantly larger errors than non-outliers, can excessively divert the model's attention toward these data points.
            """,
        x="Cadastral income (EUR)",
        y="Living area (m2)",
        caption="https://www.immoweb.be/",
    )
    + theme(
        plot_subtitle=element_text(
            size=12, face="italic"
        ),  # Customize subtitle appearance
        plot_title=element_text(size=15, face="bold"),  # Customize title appearance
    )
    + ggsize(800, 600)
)

Figure 1: Assessing Feature Cardinality: Percentage of Unique Values per Feature

For identifying potential outliers within our data, we can employ Scikit-learn’s LocalOutlierFactor. This algorithm, known as the Local Outlier Factor (LOF), is an unsupervised technique for anomaly detection. It assesses the local density deviation of a data point relative to its neighboring points. LOF identifies outliers as those data points demonstrating notably lower density in comparison to their neighbors.

In the provided code, we’ve created a function called identify_outliers. This function generates a mask that we can use to filter out data points potentially flagged as outliers.

Code

def identify_outliers(df: pd.DataFrame) -> pd.Series:
    """
    Identify outliers in a DataFrame.

    This function uses a Local Outlier Factor (LOF) algorithm to identify outliers in a given
    DataFrame. It operates on both numerical and categorical features, and it returns a binary
    Series where `True` represents an outlier and `False` represents a non-outlier.

    Parameters:
    - df (pd.DataFrame): The input DataFrame containing features for outlier identification.

    Returns:
    - pd.Series: A Boolean Series indicating outliers (True) and non-outliers (False).

    Example:
    # Load your DataFrame with features (df)
    df = load_data()

    # Identify outliers using the function
    outlier_mask = identify_outliers(df)

    # Use the outlier mask to filter your DataFrame
    filtered_df = df[~outlier_mask]  # Keep non-outliers

    Notes:
    - The function uses Local Outlier Factor (LOF) with default parameters for identifying outliers.
    - Numerical features are imputed using median values, and categorical features are one-hot encoded
      and imputed with median values.
    - The resulting Boolean Series is `True` for outliers and `False` for non-outliers.
    """
    # Extract numerical and categorical feature names
    NUMERICAL_FEATURES = df.select_dtypes("number").columns.tolist()
    CATEGORICAL_FEATURES = df.select_dtypes("object").columns.tolist()

    # Define transformers for preprocessing
    numeric_transformer = pipeline.Pipeline(
        steps=[("imputer", impute.SimpleImputer(strategy="median"))]
    )

    categorical_transformer = pipeline.Pipeline(
        steps=[
            ("encoder", preprocessing.OneHotEncoder(handle_unknown="ignore")),
            ("imputer", impute.SimpleImputer(strategy="median")),
        ]
    )

    # Create a ColumnTransformer to handle both numerical and categorical features
    preprocessor = compose.ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, NUMERICAL_FEATURES),
            ("cat", categorical_transformer, CATEGORICAL_FEATURES),
        ]
    )

    # Initialize the LocalOutlierFactor model
    clf = neighbors.LocalOutlierFactor()

    # Fit LOF to preprocessed data and make predictions
    y_pred = clf.fit_predict(preprocessor.fit_transform(df))

    # Adjust LOF predictions to create a binary outlier mask
    y_pred_adjusted = [1 if x == -1 else 0 for x in y_pred]
    outlier_mask = pd.Series(y_pred_adjusted) == 0

    return outlier_mask

As a comparison, here’s the scatter plot after removing outliers. It appears that the LocalOutlierFactor method was effective in addressing the outlier data points.

Code

outlier_mask = pre_process.identify_outliers(X)

X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True)
y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)


(
    pd.concat([X_wo_outliers, y_wo_outliers], axis=1)
    # .loc[lambda df: pre_process.identify_outliers(df.loc[:, :"living_area"])]
    .pipe(
        lambda df: ggplot(
            df, aes("cadastral_income", "living_area", fill="price", size="price")
        )
        + geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)
        + scale_fill_continuous(low="#1a9641", high="#d7191c")
        + labs(
            title="Assessing Potential Outliers",
            subtitle=""" By employing the default parameters of LocalOutlierFactor, we've reduced our training set from 3660 instances to 3427.
            This is expected to enhance our model's performance and its ability to generalize well to new data.
            """,
            x="Cadastral income (EUR)",
            y="Living area (m2)",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 600)
    )
)

Now, let’s assess whether our efforts to improve the model by addressing outliers have enhanced its predictive capabilities:

Code

train_model.run_catboost_CV(X_wo_outliers, y_wo_outliers)

(0.11052917780860605, 0.004569457889717371)

Note

By removing the outliers, our cross-validation RMSE score decreased from 0.1125 to 0.1105.

Feature Engineering

Feature engineering is vital in machine learning as it directly influences a model’s performance and predictive capabilities. By crafting and selecting pertinent features, it allows the model to capture meaningful patterns and relationships within the data. Effective feature engineering helps improve model accuracy, enhances its ability to generalize to new data, and enables the extraction of valuable insights, ultimately driving the success and efficacy of machine learning algorithms.

Feature Engineering ideas we will test in this section: - Utilize categorical columns for grouping and transform each numerical variable based on the median. - Generate bins from the continuous variables and apply the same process as described above. - Introduce polynomial features, either individually with a single feature or in combinations of two features. - Form clusters of instances using k-means clustering to capture data similarities and use these clusters as additional features. - Implement other ideas derived from empirical observations or assumptions

Utilize categorical columns for grouping and transform each numerical variable based on the median

The idea behind this feature engineering step is to leverage categorical columns as grouping criteria and then calculate the median value for each numerical variable within each group. By doing so, it aims to create new features that capture the central tendency of the numerical data for different categories, allowing the model to better understand and utilize the inherent patterns and variations within the data.

Code

# Number of unique categories per categorical variables:

X_wo_outliers.select_dtypes("object").nunique()

state                   9
kitchen_type            9
street                456
building_condition      7
city                  230
dtype: int64

Code

def FE_categorical_transform(
    X: pd.DataFrame, y: pd.Series, transform_type: str = "mean"
) -> pd.DataFrame:
    """
    Feature Engineering: Transform categorical features using CatBoost Cross-Validation.

    This function performs feature engineering by transforming categorical features using CatBoost
    Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean
    Squared Error (RMSE) scores for various combinations of categorical and numerical features.

    Parameters:
    - X (pd.DataFrame): The input DataFrame containing both categorical and numerical features.
    - transform_type (str, optional): The transformation type, such as "mean" or other valid
      CatBoost transformations. Defaults to "mean".

    Returns:
    - pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "categorical," and "numerical,"
      sorted by "mean_OOFs" in ascending order.

    Example:
    # Load your DataFrame with features (X)
    X = load_data()

    # Perform feature engineering
    result_df = FE_categorical_transform(X, transform_type="mean")

    # View the DataFrame with sorted results
    print(result_df.head())

    Notes:
    - This function uses CatBoost Cross-Validation to assess the quality of transformations for
      various combinations of categorical and numerical features.
    - The resulting DataFrame provides insights into the effectiveness of different transformations.
    - Feature engineering can help improve the performance of machine learning models.
    """
    # Initialize a list to store results
    results = []

    # Get a list of categorical and numerical columns
    categorical_columns = X.select_dtypes("object").columns
    numerical_columns = X.select_dtypes("number").columns

    # Combine the loops to have a single progress bar
    for categorical in tqdm(categorical_columns, desc="Progress"):
        for numerical in tqdm(numerical_columns):
            # Create a deep copy of the input data
            temp = X.copy(deep=True)

            # Calculate the transformation for each group within the categorical column
            temp["new_column"] = temp.groupby(categorical)[numerical].transform(
                transform_type
            )

            # Run CatBoost Cross-Validation with the transformed data
            mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)

            # Store the results as a tuple
            result = (mean_OOF, std_OOF, categorical, numerical)
            results.append(result)

            del temp, mean_OOF, std_OOF

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results, columns=["mean_OOFs", "std_OOFs", "categorical", "numerical"]
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

Note

Please, bear in mind that these feature engineering steps were precomputed due to the considerable computational time required. The outcomes were saved rather than executed during the notebook rendering to save time. However, it’s important to note that the results should remain unchanged.

Code

%%script echo skipping

FE_categorical_transform_mean = feature_engineering.FE_categorical_transform(
    X_wo_outliers, y_wo_outliers
)

Couldn't find program: 'echo'

Code

%%script echo skipping

FE_categorical_transform_mean.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_categorical_transform_mean")}.parquet.gzip',
    compression="gzip",
)

FE_categorical_transform_mean.head(15)

Couldn't find program: 'echo'

As evident, the best result was obtained by treating the city feature as a categorical variable and calculating the median of cadastral_income based on this categorization. This result aligns logically with the feature importances seen in Part 5.

Code

pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "FE_categorical_transform_mean.parquet.gzip"
    )
).head()

	mean_OOFs	std_OOFs	categorical	numerical
53	0.108973	0.006262	city	cadastral_income
39	0.108985	0.004980	building_condition	yearly_theoretical_total_energy_consumption
33	0.109381	0.005434	building_condition	bedrooms
31	0.109478	0.004887	street	cadastral_income
43	0.109540	0.004944	building_condition	living_area

Generate bins from the continuous variables

The idea behind this feature engineering step is to discretize continuous variables by creating bins or categories from their values. These bins then serve as categorical columns. By using these new categorical columns for grouping, we can transform each numerical variable by replacing its values with the median of the respective category it belongs to, just like the feature engineering method we demonstrated above.

Code

def FE_continuous_transform(
    X: pd.DataFrame, y: pd.Series, transform_type: str = "mean"
) -> pd.DataFrame:
    """
    Feature Engineering: Transform continuous features using CatBoost Cross-Validation.

    This function performs feature engineering by transforming continuous features using CatBoost
    Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean
    Squared Error (RMSE) scores for various combinations of discretized and transformed continuous
    features.

    Parameters:
    - X (pd.DataFrame): The input DataFrame containing both continuous and categorical features.
    - y (pd.Series): The target variable for prediction.
    - transform_type (str, optional): The transformation type, such as "mean" or other valid
      CatBoost transformations. Defaults to "mean".

    Returns:
    - pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "discretized_continuous,"
      and "transformed_continuous," sorted by "mean_OOFs" in ascending order.

    Example:
    # Load your DataFrame with features (X) and target variable (y)
    X, y = load_data()

    # Perform feature engineering
    result_df = FE_continuous_transform(X, y, transform_type="mean")

    # View the DataFrame with sorted results
    print(result_df.head())

    Notes:
    - This function uses CatBoost Cross-Validation to assess the quality of transformations for
      various combinations of discretized and transformed continuous features.
    - The number of bins for discretization is determined using Sturges' rule.
    - The resulting DataFrame provides insights into the effectiveness of different transformations.
    - Feature engineering can help improve the performance of machine learning models.
    """
    # Initialize a list to store results
    results = []

    # Get a list of continuous and numerical columns
    continuous_columns = X.select_dtypes("number").columns
    optimal_bins = int(np.floor(np.log2(X.shape[0])) + 1)

    # Combine the loops to have a single progress bar
    for discretized_continuous in tqdm(continuous_columns, desc="Progress:"):
        for transformed_continuous in tqdm(continuous_columns):
            if discretized_continuous != transformed_continuous:
                # Create a deep copy of the input data
                temp = X.copy(deep=True)

                discretizer = pipeline.Pipeline(
                    steps=[
                        ("imputer", impute.SimpleImputer(strategy="median")),
                        (
                            "add_bins",
                            preprocessing.KBinsDiscretizer(
                                encode="ordinal", n_bins=optimal_bins
                            ),
                        ),
                    ]
                )

                temp[discretized_continuous] = discretizer.fit_transform(
                    X[[discretized_continuous]]
                )

                # Calculate the transformation for each group within the categorical column
                temp["new_column"] = temp.groupby(discretized_continuous)[
                    transformed_continuous
                ].transform(transform_type)

                # Run CatBoost Cross-Validation with the transformed data
                mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)

                # Store the results as a tuple
                result = (
                    mean_OOF,
                    std_OOF,
                    discretized_continuous,
                    transformed_continuous,
                )
                results.append(result)

                del temp, mean_OOF, std_OOF

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results,
        columns=[
            "mean_OOFs",
            "std_OOFs",
            "discretized_continuous",
            "transformed_continuous",
        ],
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

Code

%%script echo skipping

FE_continuous_transform_mean = feature_engineering.FE_continuous_transform(
    X_wo_outliers, y_wo_outliers
)

FE_continuous_transform_mean.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_continuous_transform_mean")}.parquet.gzip',
    compression="gzip",
)

FE_categorical_transform_mean.head(15)

Couldn't find program: 'echo'

This approach was not as effective as our prior method. However, combining bathrooms with yearly_theoretical_total_energy_consumption yielded the best outcome.

Code

pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "FE_continuous_transform_mean.parquet.gzip"
    )
).head(10)

	mean_OOFs	std_OOFs	discretized_continuous	transformed_continuous
55	0.109328	0.005094	bathrooms	yearly_theoretical_total_energy_consumption
59	0.109328	0.005094	bathrooms	living_area
58	0.109328	0.005094	bathrooms	cadastral_income
57	0.109328	0.005094	bathrooms	lat
56	0.109328	0.005094	bathrooms	surface_of_the_plot
50	0.109328	0.005094	bathrooms	bedrooms
52	0.109328	0.005094	bathrooms	toilets
39	0.109417	0.004831	lng	living_area
1	0.109426	0.004587	bedrooms	toilets
4	0.109426	0.004587	bedrooms	bathrooms

Introduce polynomial features

The idea behind introducing polynomial features is to capture non-linear relationships within the data. By raising individual features to higher powers or considering interactions between pairs of features, this step allows the model to better represent complex patterns that cannot be adequately expressed with linear relationships alone. It enhances the model’s ability to learn and predict outcomes that exhibit curvilinear or interactive behavior.

Code

def FE_polynomial_features(
    X: pd.DataFrame, y: pd.Series, combinations: int = 1
) -> pd.DataFrame:
    """
    Generate polynomial features for combinations of numerical columns and train a CatBoost model.

    Parameters:
        X (pd.DataFrame): The input DataFrame with features.
        y (pd.Series): The target variable.
        combinations (int, optional): The number of combinations of numerical columns. Default is 1.

    Returns:
        pd.DataFrame: A DataFrame containing results sorted by mean OOF scores.

    Example:
        X_wo_outliers = pd.DataFrame(...)  # Your input data
        y_wo_outliers = pd Series(...)  # Your target variable
        result = FE_polynomial_features(X_wo_outliers, y_wo_outliers)

    Transformations:
        - Imputes missing values in numerical columns using the median.
        - Generates polynomial features, including interaction terms, for selected numerical columns.
        - Trains a CatBoost model and calculates mean and standard deviation of out-of-fold (OOF) scores.
    """

    # Initialize a list to store results
    results = []

    # Get a list of continuous and numerical columns
    numerical_columns = X.select_dtypes("number").columns

    # Combine the loops to have a single progress bar
    for numerical_col in tqdm(
        list(itertools.combinations(numerical_columns, r=combinations))
    ):
        polyfeatures = compose.make_column_transformer(
            (
                pipeline.make_pipeline(
                    impute.SimpleImputer(strategy="median"),
                    preprocessing.PolynomialFeatures(interaction_only=False),
                ),
                list(numerical_col),
            ),
            remainder="passthrough",
        ).set_output(transform="pandas")

        temp = polyfeatures.fit_transform(X)
        mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)

        # Store the results as a tuple
        result = (
            mean_OOF,
            std_OOF,
            numerical_col,
        )
        results.append(result)

        del temp, mean_OOF, std_OOF

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results,
        columns=[
            "mean_OOFs",
            "std_OOFs",
            "numerical_col",
        ],
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

n=1

Let’s see the impact of applying polynomial feature engineering to a single feature.

Code

%%script echo skipping

FE_polynomial_features_combinations_1 = FE_polynomial_features(
    X_wo_outliers, y_wo_outliers
)

FE_polynomial_features_combinations_1.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_1")}.parquet.gzip',
    compression="gzip",
)

FE_polynomial_features_combinations_1.head(15)

Couldn't find program: 'echo'

Code

pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "FE_polynomial_features_combinations_1.parquet.gzip"
    )
).head(10)

	mean_OOFs	std_OOFs	numerical_col
10	0.109938	0.005047	[living_area]
9	0.110339	0.004012	[cadastral_income]
3	0.110628	0.004018	[lng]
0	0.111066	0.004765	[bedrooms]
8	0.111099	0.005039	[lat]
4	0.111166	0.004879	[primary_energy_consumption]
6	0.111271	0.004908	[yearly_theoretical_total_energy_consumption]
1	0.111276	0.005359	[number_of_frontages]
7	0.111332	0.004815	[surface_of_the_plot]
2	0.111782	0.004741	[toilets]

n=2

How about two features combined…

Code

%%script echo skipping

FE_polynomial_features_combinations_2 = FE_polynomial_features(
    X_wo_outliers, y_wo_outliers, combinations=2
)

FE_polynomial_features_combinations_2.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_2")}.parquet.gzip',
    compression="gzip",
)

FE_polynomial_features_combinations_2.head(15)

Couldn't find program: 'echo'

Code

pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "FE_polynomial_features_combinations_2.parquet.gzip"
    )
).head(10)

	mean_OOFs	std_OOFs	numerical_col
31	0.109413	0.004690	[lng, lat]
9	0.109625	0.005558	[bedrooms, living_area]
53	0.109809	0.005321	[lat, living_area]
52	0.109814	0.003485	[lat, cadastral_income]
7	0.109847	0.005399	[bedrooms, lat]
19	0.109962	0.004999	[toilets, lng]
17	0.110057	0.004554	[number_of_frontages, cadastral_income]
42	0.110124	0.004511	[bathrooms, lat]
46	0.110128	0.004944	[yearly_theoretical_total_energy_consumption, ...
28	0.110154	0.004644	[lng, bathrooms]

Form clusters of instances using k-means clustering

The idea behind using k-means clustering in feature engineering is to group data points into clusters based on their similarity. By doing so, we create a new set of features that represent these clusters, which can capture patterns or relationships within the data that might be less apparent in the original features. These cluster features can be valuable for machine learning models, as they provide a more compact and informative representation of the data, potentially improving predictive performance.

Code

class FeatureSelector(BaseEstimator, TransformerMixin):
    """
    A transformer for selecting specific columns from a DataFrame.

    This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base.
    It overrides the fit and transform methods from the parent classes.

    Attributes:
        feature_names_in_ (list): The names of the features to select.
        n_features_in_ (int): The number of features to select.

    Methods:
        fit(X, y=None): Fit the transformer. Returns self.
        transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features.
    """

    def __init__(self, feature_names_in_):
        """
        Constructs all the necessary attributes for the FeatureSelector object.

        Args:
            feature_names_in_ (list): The names of the features to select.
        """
        self.feature_names_in_ = feature_names_in_
        self.n_features_in_ = len(feature_names_in_)

    def fit(self, X, y=None):
        """
        Fit the transformer. This method doesn't do anything as no fitting is necessary.

        Args:
            X (DataFrame): The input data.
            y (array-like, optional): The target variable. Defaults to None.

        Returns:
            self: The instance itself.
        """
        return self

    def transform(self, X, y=None):
        """
        Apply the transformation. Selects the features from the input data.

        Args:
            X (DataFrame): The input data.
            y (array-like, optional): The target variable. Defaults to None.

        Returns:
            DataFrame: A DataFrame with only the selected features.
        """
        return X.loc[:, self.feature_names_in_].copy(deep=True)

Code

def FE_KMeans(
    X: pd.DataFrame,
    y: pd.Series,
    n_clusters_min: int = 1,
    n_clusters_max: int = 8,
) -> pd.DataFrame:
    """Performs K-Means clustering-based feature engineering followed by model training.

    Args:
        X (pd.DataFrame): The input feature matrix.
        y (pd.Series): The target variable.
        n_clusters_min (int, optional): The minimum number of clusters to consider. Defaults to 1.
        n_clusters_max (int, optional): The maximum number of clusters to consider. Defaults to 8.

    Returns:
        pd.DataFrame: A DataFrame containing the results of feature engineering with K-Means clustering.

    Example:
        >>> results_df = FE_KNN(X_wo_outliers, y_wo_outliers)
    """
    # Initialize a list to store results
    results = []

    # Get a list of continuous and numerical columns
    numerical_columns = X.head().select_dtypes("number").columns.to_list()
    categorical_columns = X.head().select_dtypes("object").columns.to_list()

    for n_cluster in tqdm(range(n_clusters_min, n_clusters_max)):
        # Prepare pipelines for corresponding columns:
        numerical_pipeline = pipeline.Pipeline(
            steps=[
                ("num_selector", FeatureSelector(numerical_columns)),
                ("imputer", impute.SimpleImputer(strategy="median")),
            ]
        )

        categorical_pipeline = pipeline.Pipeline(
            steps=[
                ("cat_selector", FeatureSelector(categorical_columns)),
                ("imputer", impute.SimpleImputer(strategy="most_frequent")),
                (
                    "onehot",
                    preprocessing.OneHotEncoder(
                        handle_unknown="ignore", sparse_output=False
                    ),
                ),
            ]
        )

        # Put all the pipelines inside a FeatureUnion:
        data_preprocessing_pipeline = pipeline.FeatureUnion(
            n_jobs=-1,
            transformer_list=[
                ("numerical_pipeline", numerical_pipeline),
                ("categorical_pipeline", categorical_pipeline),
            ],
        )

        temp = pd.DataFrame(data_preprocessing_pipeline.fit_transform(X))

        KMeans = cluster.KMeans(n_init=10, n_clusters=n_cluster)
        KMeans.fit_transform(temp)

        groups = pd.Series(KMeans.labels_, name="groups")

        concatanated_df = pd.concat([temp, groups], axis="columns")

        mean_OOF, std_OOF = train_model.run_catboost_CV(X=concatanated_df, y=y)

        # Store the results as a tuple
        result = (
            mean_OOF,
            std_OOF,
            n_cluster,
        )
        results.append(result)

        del temp, mean_OOF, std_OOF, KMeans, groups, concatanated_df, result

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results,
        columns=[
            "mean_OOFs",
            "std_OOFs",
            "n_cluster",
        ],
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

Code

%%script echo skipping

FE_KMeans_df = FE_KMeans(
    X_wo_outliers,
    y_wo_outliers,
    n_clusters_min=1,
    n_clusters_max=101,
)

FE_KMeans_df.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_KNN_df")}.parquet.gzip',
    compression="gzip",
)

FE_KNN_df.head(15)

Couldn't find program: 'echo'

As k-means clustering is an unsupervised algorithm, determining the appropriate k-values requires testing various values to assess their impact on our validation scores. As observed, this approach didn’t yield significant results in our case, as the best validation score was obtained when n=1.

Code

pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_KNN_df.parquet.gzip")
).head(10)

	mean_OOFs	std_OOFs	n_cluster
0	0.111884	0.005143	1
1	0.112153	0.006174	2
5	0.112264	0.005858	6
24	0.112313	0.005602	25
7	0.112326	0.005124	8
43	0.112333	0.005819	44
31	0.112352	0.005588	32
11	0.112366	0.005206	12
95	0.112454	0.006621	96
26	0.112472	0.006096	27

Implement other ideas derived from empirical observations or assumptions

Though new features can be generated through systematic methods, domain knowledge can also inspire their creation. The idea behind this is to allow for the incorporation of unconventional or domain-specific insights that may not fit standard feature engineering techniques. It encourages the exploration of novel features or transformations based on practical experiences or theoretical assumptions to potentially uncover hidden patterns or relationships within the data. This open-ended approach can lead to creative and tailored feature engineering solutions.

Here are some ideas to consider:

Geospatial Features:
- Create clusters or neighborhoods based on features to capture similarities.
Area-related Features:
- Calculate the ratio of “living_area” to “surface_of_the_plot” to get an idea of the density or spaciousness of the property.
Energy Efficiency Features:
- Compute the energy efficiency ratio by dividing “yearly_theoretical_total_energy_consumption” by “primary_energy_consumption.”
- Compute energy efficiency by dividing primary_energy_consumption with living_area
Toilet and Bathroom Features:
- Combine “toilets” and “bathrooms” into a single “total_bathrooms” feature to simplify the model.
- Calculate total number of rooms by adding up bedrooms + toilets + bathrooms
Taxation Features:
- Incorporate “cadastral_income” as a measure of property value for taxation. You can create bins or categories for this variable.
Value for Money:
- Divide cadastral_income by bedrooms to see if the property is a good bargain
- similarly, Divide cadastral_income by living_area

Code

def FE_ideas(X):
    """Performs additional feature engineering on the input DataFrame.

    Args:
        X (pd.DataFrame): The input DataFrame containing the original features.

    Returns:
        pd.DataFrame: A DataFrame with additional engineered features.

    Example:
        >>> engineered_data = FE_ideas(original_data)
    """
    temp = X.assign(
        energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption
        / df.primary_energy_consumption,
        energy_efficiency_2=lambda df: df.primary_energy_consumption / df.living_area,
        total_bathrooms=lambda df: df.toilets + df.bathrooms,
        total_number_rooms=lambda df: df.toilets + df.bathrooms + df.bedrooms,
        spaciousness_1=lambda df: df.living_area / df.surface_of_the_plot,
        spaciousness_2=lambda df: df.living_area / df.total_number_rooms,
        bargain_1=lambda df: df.cadastral_income / df.bedrooms,
        bargain_2=lambda df: df.cadastral_income / df.living_area,
    )
    return temp.loc[:, "energy_efficiency_1":]

Code

def FE_try_ideas(
    X: pd.DataFrame,
    y: pd.Series,
) -> pd.DataFrame:
    """Performs feature engineering experiments by adding new features and evaluating their impact on model performance.

    Args:
        X (pd.DataFrame): The input feature matrix.
        y (pd.Series): The target variable.

    Returns:
        pd.DataFrame: A DataFrame containing the results of feature engineering experiments.

    Example:
        >>> results_df = FE_try_ideas(X, y)
    """
    # Initialize a list to store results
    results = []

    # Get a list of continuous and numerical columns
    numerical_columns = X.select_dtypes("number").columns

    # Apply additional feature engineering ideas
    feature_df = FE_ideas(X)

    for feature in tqdm(feature_df.columns):
        # Concatenate the original features with the newly engineered feature
        temp = pd.concat([X, feature_df[feature]], axis="columns")

        # Train the model with the augmented features and get the mean and standard deviation of OOF scores
        mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)

        # Store the results as a tuple
        result = (
            mean_OOF,
            std_OOF,
            feature,
        )
        results.append(result)

        del temp, mean_OOF, std_OOF

    # Create a DataFrame from the results and sort it by mean OOF scores
    result_df = pd.DataFrame(
        results,
        columns=[
            "mean_OOFs",
            "std_OOFs",
            "feature",
        ],
    )
    result_df = result_df.sort_values(by="mean_OOFs")
    return result_df

Code

%%script echo skipping

FE_try_ideas = FE_try_ideas(X_wo_outliers, y_wo_outliers)

FE_try_ideas.to_parquet(
    f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_try_ideas")}.parquet.gzip',
    compression="gzip",
)

FE_try_ideas

Couldn't find program: 'echo'

As can be seen below, the best feature this time was spaciousness_1, representing df.living_area divided by df.surface_of_the_plot.

Code

pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_try_ideas.parquet.gzip")
)

	mean_OOFs	std_OOFs	feature
4	0.109305	0.004652	spaciousness_1
7	0.109558	0.004372	bargain_2
3	0.109969	0.005117	total_number_rooms
6	0.109976	0.004303	bargain_1
2	0.110545	0.004884	total_bathrooms
5	0.110603	0.004715	spaciousness_2
1	0.110666	0.005749	energy_efficiency_2
0	0.110722	0.005120	energy_efficiency_1

Summary table of the tested conditions

The initial model achieved the best mean out-of-folds score of 0.1107. However, we made modifications to expedite training by reducing iterations to 100 and increasing the learning rate to 0.2, resulting in a new baseline model with a score of 0.1105 after outlier removal. This serves as our reference point to assess the impact of various feature engineering techniques.

Subsequent feature engineering approaches, including utilizing categorical columns for groupby and transformation, creating bins from continuous data, and implementing other ideas, led to marginal score improvements, with the lowest at 0.1089. Polynomial features, with n=2 and n=1, demonstrated slightly higher scores of 0.1094 and 0.1099, respective.

Now, we will proceed to assess the efficacy of two of the best approaches, namely: utilizing categorical columns for groupby and transformation and implementing additional ideas. We will conduct this evaluation using CatBoost’s built-in select_features as outlined in part 5. Let’s dive in…

Condition	Best mean OOFs	std OOFs
*Original*	*0.1107*	NA
Use categorical columns for groupby/transform	0.1089	0.0062
Create bins from continuous data and use groupby/transform	0.1093	0.0050
Implementing the rest of the ideas	0.1093	0.0046
Polynomial features (n=2)	0.1094	0.0046
Polynomial features (n=1)	0.1099	0.0050
After Outlier filter	0.1105	0.0045
k-means clustering	0.1118	0.0051
Sped up version	0.1125	0.0044

Final feature selection

Code

def prepare_df_for_final_feature_selection(X):
    return X.assign(
        city_group=lambda df: df.groupby("city")["cadastral_income"].transform(
            "median"
        ),
        building_condition_group=lambda df: df.groupby("building_condition")[
            "yearly_theoretical_total_energy_consumption"
        ].transform("median"),
        energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption
        / df.primary_energy_consumption,
        energy_efficiency_2=lambda df: df.primary_energy_consumption / df.living_area,
        total_bathrooms=lambda df: df.toilets + df.bathrooms,
        total_number_rooms=lambda df: df.toilets + df.bathrooms + df.bedrooms,
        spaciousness_1=lambda df: df.living_area / df.surface_of_the_plot,
        spaciousness_2=lambda df: df.living_area / df.total_number_rooms,
        bargain_1=lambda df: df.cadastral_income / df.bedrooms,
        bargain_2=lambda df: df.cadastral_income / df.living_area,
    )


X_final_feature_selection = prepare_df_for_final_feature_selection(X_wo_outliers)

Code

X_train, X_val, y_train, y_val = model_selection.train_test_split(
    X_final_feature_selection,
    y_wo_outliers,
    test_size=0.2,
    random_state=utils.Configuration.seed,
)

Code

regressor = catboost.CatBoostRegressor(
    iterations=1000,
    cat_features=X_final_feature_selection.select_dtypes("object").columns.to_list(),
    random_seed=utils.Configuration.seed,
    loss_function="RMSE",
)

rfe_dict = regressor.select_features(
    algorithm="RecursiveByShapValues",
    shap_calc_type="Exact",
    X=X_train,
    y=y_train,
    eval_set=(X_val, y_val),
    features_for_select="0-25",
    num_features_to_select=1,
    steps=20,
    verbose=250,
    train_final_model=False,
    plot=True,
)

Learning rate set to 0.059655
Step #1 out of 20
0:  learn: 0.3071998    test: 0.3029816 best: 0.3029816 (0) total: 26.3ms   remaining: 26.2s
250:    learn: 0.0864288    test: 0.1123460 best: 0.1123460 (250)   total: 6.69s    remaining: 20s
500:    learn: 0.0680250    test: 0.1083075 best: 0.1082636 (493)   total: 13.8s    remaining: 13.7s
750:    learn: 0.0564192    test: 0.1079016 best: 0.1077757 (639)   total: 20.8s    remaining: 6.91s
999:    learn: 0.0482570    test: 0.1076466 best: 0.1075005 (964)   total: 28.2s    remaining: 0us

bestTest = 0.1075004858
bestIteration = 964

Shrink model to first 965 iterations.
Feature #21 eliminated
Feature #20 eliminated
Feature #23 eliminated
Feature #2 eliminated
Step #2 out of 20
0:  learn: 0.3070684    test: 0.3029727 best: 0.3029727 (0) total: 26.8ms   remaining: 26.8s
250:    learn: 0.0872651    test: 0.1136076 best: 0.1135979 (249)   total: 7.2s remaining: 21.5s
500:    learn: 0.0684021    test: 0.1105477 best: 0.1105477 (500)   total: 14.6s    remaining: 14.5s
750:    learn: 0.0574736    test: 0.1096400 best: 0.1096363 (721)   total: 22.7s    remaining: 7.51s
999:    learn: 0.0497732    test: 0.1094414 best: 0.1092710 (833)   total: 30.5s    remaining: 0us

bestTest = 0.1092709663
bestIteration = 833

Shrink model to first 834 iterations.
Feature #4 eliminated
Feature #22 eliminated
Feature #24 eliminated
Step #3 out of 20
0:  learn: 0.3058962    test: 0.3017255 best: 0.3017255 (0) total: 31.1ms   remaining: 31s
250:    learn: 0.0876574    test: 0.1136779 best: 0.1136779 (250)   total: 7.68s    remaining: 22.9s
500:    learn: 0.0699441    test: 0.1096189 best: 0.1096189 (500)   total: 15.4s    remaining: 15.4s
750:    learn: 0.0595959    test: 0.1080552 best: 0.1080552 (750)   total: 23s  remaining: 7.63s
999:    learn: 0.0515838    test: 0.1074155 best: 0.1074083 (998)   total: 30.5s    remaining: 0us

bestTest = 0.1074082848
bestIteration = 998

Shrink model to first 999 iterations.
Feature #12 eliminated
Feature #0 eliminated
Feature #7 eliminated
Step #4 out of 20
0:  learn: 0.3062525    test: 0.3017927 best: 0.3017927 (0) total: 25.3ms   remaining: 25.3s
250:    learn: 0.0903487    test: 0.1137526 best: 0.1137107 (249)   total: 7.68s    remaining: 22.9s
500:    learn: 0.0737682    test: 0.1106448 best: 0.1106448 (500)   total: 15.5s    remaining: 15.4s
750:    learn: 0.0629695    test: 0.1094495 best: 0.1094173 (739)   total: 23.6s    remaining: 7.84s
999:    learn: 0.0549456    test: 0.1092524 best: 0.1092519 (996)   total: 31.9s    remaining: 0us

bestTest = 0.1092518693
bestIteration = 996

Shrink model to first 997 iterations.
Feature #9 eliminated
Feature #3 eliminated
Step #5 out of 20
0:  learn: 0.3069279    test: 0.3031549 best: 0.3031549 (0) total: 23.2ms   remaining: 23.2s
250:    learn: 0.0909568    test: 0.1127742 best: 0.1127742 (250)   total: 8.44s    remaining: 25.2s
500:    learn: 0.0744859    test: 0.1100086 best: 0.1099750 (494)   total: 16.9s    remaining: 16.8s
750:    learn: 0.0645317    test: 0.1093485 best: 0.1092780 (733)   total: 25.7s    remaining: 8.51s
999:    learn: 0.0566761    test: 0.1092288 best: 0.1091824 (846)   total: 33.5s    remaining: 0us

bestTest = 0.1091824436
bestIteration = 846

Shrink model to first 847 iterations.
Feature #6 eliminated
Feature #11 eliminated
Step #6 out of 20
0:  learn: 0.3065169    test: 0.3023901 best: 0.3023901 (0) total: 34.2ms   remaining: 34.2s
250:    learn: 0.0908925    test: 0.1148633 best: 0.1148633 (250)   total: 8.37s    remaining: 25s
500:    learn: 0.0742620    test: 0.1125271 best: 0.1125271 (500)   total: 16.2s    remaining: 16.2s
750:    learn: 0.0645501    test: 0.1119094 best: 0.1118467 (745)   total: 24.1s    remaining: 8s
999:    learn: 0.0567811    test: 0.1114134 best: 0.1113041 (981)   total: 32.8s    remaining: 0us

bestTest = 0.1113040714
bestIteration = 981

Shrink model to first 982 iterations.
Feature #1 eliminated
Feature #25 eliminated
Step #7 out of 20
0:  learn: 0.3063087    test: 0.3019867 best: 0.3019867 (0) total: 33.3ms   remaining: 33.2s
250:    learn: 0.0901288    test: 0.1153040 best: 0.1153040 (250)   total: 8.26s    remaining: 24.7s
500:    learn: 0.0732513    test: 0.1131686 best: 0.1131049 (487)   total: 16.8s    remaining: 16.7s
750:    learn: 0.0631777    test: 0.1123835 best: 0.1122779 (743)   total: 25.1s    remaining: 8.31s
999:    learn: 0.0554371    test: 0.1123259 best: 0.1122143 (920)   total: 33.6s    remaining: 0us

bestTest = 0.1122143004
bestIteration = 920

Shrink model to first 921 iterations.
Feature #5 eliminated
Feature #8 eliminated
Step #8 out of 20
0:  learn: 0.3068859    test: 0.3026476 best: 0.3026476 (0) total: 3.81ms   remaining: 3.8s
250:    learn: 0.0944589    test: 0.1185560 best: 0.1185467 (248)   total: 819ms    remaining: 2.44s
500:    learn: 0.0768410    test: 0.1162574 best: 0.1160705 (474)   total: 1.77s    remaining: 1.76s
750:    learn: 0.0655894    test: 0.1157924 best: 0.1156926 (730)   total: 2.48s    remaining: 821ms
999:    learn: 0.0574991    test: 0.1152245 best: 0.1152077 (992)   total: 3.31s    remaining: 0us

bestTest = 0.1152077068
bestIteration = 992

Shrink model to first 993 iterations.
Feature #13 eliminated
Step #9 out of 20
0:  learn: 0.3068033    test: 0.3025327 best: 0.3025327 (0) total: 5.7ms    remaining: 5.7s
250:    learn: 0.1005112    test: 0.1226247 best: 0.1226247 (250)   total: 829ms    remaining: 2.47s
500:    learn: 0.0830670    test: 0.1205778 best: 0.1205238 (496)   total: 1.46s    remaining: 1.45s
750:    learn: 0.0720636    test: 0.1202553 best: 0.1201130 (689)   total: 2.24s    remaining: 744ms
999:    learn: 0.0640641    test: 0.1201366 best: 0.1201130 (689)   total: 3.14s    remaining: 0us

bestTest = 0.1201130073
bestIteration = 689

Shrink model to first 690 iterations.
Feature #19 eliminated
Step #10 out of 20
0:  learn: 0.3069152    test: 0.3030754 best: 0.3030754 (0) total: 5.21ms   remaining: 5.21s
250:    learn: 0.1032541    test: 0.1264019 best: 0.1264019 (250)   total: 915ms    remaining: 2.73s
500:    learn: 0.0881802    test: 0.1241960 best: 0.1241624 (469)   total: 1.86s    remaining: 1.85s
750:    learn: 0.0780515    test: 0.1238596 best: 0.1238062 (708)   total: 2.48s    remaining: 823ms
999:    learn: 0.0703171    test: 0.1242382 best: 0.1236966 (778)   total: 3.45s    remaining: 0us

bestTest = 0.1236965761
bestIteration = 778

Shrink model to first 779 iterations.
Feature #18 eliminated
Step #11 out of 20
0:  learn: 0.3073732    test: 0.3033657 best: 0.3033657 (0) total: 3.8ms    remaining: 3.79s
250:    learn: 0.1098708    test: 0.1286291 best: 0.1286073 (249)   total: 625ms    remaining: 1.87s
500:    learn: 0.0933699    test: 0.1259515 best: 0.1259515 (500)   total: 1.28s    remaining: 1.28s
750:    learn: 0.0831048    test: 0.1252235 best: 0.1252174 (749)   total: 1.91s    remaining: 634ms
999:    learn: 0.0750814    test: 0.1247368 best: 0.1246789 (969)   total: 2.76s    remaining: 0us

bestTest = 0.1246789365
bestIteration = 969

Shrink model to first 970 iterations.
Feature #10 eliminated
Step #12 out of 20
0:  learn: 0.3073693    test: 0.3032811 best: 0.3032811 (0) total: 67.5ms   remaining: 1m 7s
250:    learn: 0.1228026    test: 0.1393940 best: 0.1393940 (250)   total: 667ms    remaining: 1.99s
500:    learn: 0.1083876    test: 0.1359537 best: 0.1358055 (479)   total: 1.59s    remaining: 1.58s
750:    learn: 0.0987005    test: 0.1352919 best: 0.1352141 (731)   total: 2.17s    remaining: 721ms
999:    learn: 0.0914355    test: 0.1353158 best: 0.1351822 (880)   total: 3.22s    remaining: 0us

bestTest = 0.1351822259
bestIteration = 880

Shrink model to first 881 iterations.
Step #13 out of 20
0:  learn: 0.3073693    test: 0.3032811 best: 0.3032811 (0) total: 4.37ms   remaining: 4.36s
250:    learn: 0.1228026    test: 0.1393940 best: 0.1393940 (250)   total: 1s   remaining: 3s
500:    learn: 0.1083876    test: 0.1359537 best: 0.1358055 (479)   total: 1.9s remaining: 1.9s
750:    learn: 0.0987005    test: 0.1352919 best: 0.1352141 (731)   total: 2.5s remaining: 829ms
999:    learn: 0.0914355    test: 0.1353158 best: 0.1351822 (880)   total: 3.46s    remaining: 0us

bestTest = 0.1351822259
bestIteration = 880

Shrink model to first 881 iterations.
Feature #17 eliminated
Step #14 out of 20
0:  learn: 0.3071755    test: 0.3033561 best: 0.3033561 (0) total: 5.21ms   remaining: 5.2s
250:    learn: 0.1378278    test: 0.1523478 best: 0.1523478 (250)   total: 1.07s    remaining: 3.2s
500:    learn: 0.1235818    test: 0.1497932 best: 0.1497695 (496)   total: 1.99s    remaining: 1.98s
750:    learn: 0.1141955    test: 0.1498344 best: 0.1496295 (720)   total: 2.6s remaining: 861ms
999:    learn: 0.1071168    test: 0.1503970 best: 0.1496295 (720)   total: 3.6s remaining: 0us

bestTest = 0.1496294939
bestIteration = 720

Shrink model to first 721 iterations.
Step #15 out of 20
0:  learn: 0.3071755    test: 0.3033561 best: 0.3033561 (0) total: 5.07ms   remaining: 5.06s
250:    learn: 0.1378278    test: 0.1523478 best: 0.1523478 (250)   total: 1.14s    remaining: 3.39s
500:    learn: 0.1235818    test: 0.1497932 best: 0.1497695 (496)   total: 2.28s    remaining: 2.27s
750:    learn: 0.1141955    test: 0.1498344 best: 0.1496295 (720)   total: 2.98s    remaining: 989ms
999:    learn: 0.1071168    test: 0.1503970 best: 0.1496295 (720)   total: 3.87s    remaining: 0us

bestTest = 0.1496294939
bestIteration = 720

Shrink model to first 721 iterations.
Feature #14 eliminated
Step #16 out of 20
0:  learn: 0.3069654    test: 0.3029158 best: 0.3029158 (0) total: 4.26ms   remaining: 4.26s
250:    learn: 0.1519723    test: 0.1598175 best: 0.1598033 (248)   total: 1.04s    remaining: 3.09s
500:    learn: 0.1401927    test: 0.1580470 best: 0.1579744 (499)   total: 1.61s    remaining: 1.61s
750:    learn: 0.1327334    test: 0.1575546 best: 0.1575183 (731)   total: 2.7s remaining: 895ms
999:    learn: 0.1272584    test: 0.1579738 best: 0.1574747 (784)   total: 3.7s remaining: 0us

bestTest = 0.1574747019
bestIteration = 784

Shrink model to first 785 iterations.
Step #17 out of 20
0:  learn: 0.3069654    test: 0.3029158 best: 0.3029158 (0) total: 2.43ms   remaining: 2.43s
250:    learn: 0.1519723    test: 0.1598175 best: 0.1598033 (248)   total: 581ms    remaining: 1.73s
500:    learn: 0.1401927    test: 0.1580470 best: 0.1579744 (499)   total: 1.35s    remaining: 1.35s
750:    learn: 0.1327334    test: 0.1575546 best: 0.1575183 (731)   total: 2.04s    remaining: 677ms
999:    learn: 0.1272584    test: 0.1579738 best: 0.1574747 (784)   total: 2.61s    remaining: 0us

bestTest = 0.1574747019
bestIteration = 784

Shrink model to first 785 iterations.
Step #18 out of 20
0:  learn: 0.3069654    test: 0.3029158 best: 0.3029158 (0) total: 3ms  remaining: 3s
250:    learn: 0.1519723    test: 0.1598175 best: 0.1598033 (248)   total: 1.07s    remaining: 3.19s
500:    learn: 0.1401927    test: 0.1580470 best: 0.1579744 (499)   total: 1.99s    remaining: 1.99s
750:    learn: 0.1327334    test: 0.1575546 best: 0.1575183 (731)   total: 2.68s    remaining: 889ms
999:    learn: 0.1272584    test: 0.1579738 best: 0.1574747 (784)   total: 3.69s    remaining: 0us

bestTest = 0.1574747019
bestIteration = 784

Shrink model to first 785 iterations.
Feature #16 eliminated
Step #19 out of 20
0:  learn: 0.3099541    test: 0.3058748 best: 0.3058748 (0) total: 3.14ms   remaining: 3.13s
250:    learn: 0.2128800    test: 0.2156926 best: 0.2148488 (114)   total: 740ms    remaining: 2.22s
500:    learn: 0.2095531    test: 0.2179295 best: 0.2148488 (114)   total: 1.3s remaining: 1.3s
750:    learn: 0.2082375    test: 0.2193735 best: 0.2148488 (114)   total: 1.97s    remaining: 654ms
999:    learn: 0.2075860    test: 0.2206064 best: 0.2148488 (114)   total: 3.04s    remaining: 0us

bestTest = 0.2148488447
bestIteration = 114

Shrink model to first 115 iterations.
Step #20 out of 20
0:  learn: 0.3099541    test: 0.3058748 best: 0.3058748 (0) total: 2.15ms   remaining: 2.14s
250:    learn: 0.2128800    test: 0.2156926 best: 0.2148488 (114)   total: 583ms    remaining: 1.74s
500:    learn: 0.2095531    test: 0.2179295 best: 0.2148488 (114)   total: 1.75s    remaining: 1.74s
750:    learn: 0.2082375    test: 0.2193735 best: 0.2148488 (114)   total: 2.32s    remaining: 768ms
999:    learn: 0.2075860    test: 0.2206064 best: 0.2148488 (114)   total: 3.23s    remaining: 0us

bestTest = 0.2148488447
bestIteration = 114

Shrink model to first 115 iterations.

Based on our evaluation, it is recommended to retain the following features:

‘city_group’
‘building_condition_group’
‘energy_efficiency_1’
‘energy_efficiency_2’
‘bargain_1’
‘bargain_2’

However, we should remove the following two features since our analysis indicates that better features have been incorporated:

‘kitchen_type’
‘toilets’

These feature selections should help optimize our model even further.

Based on these insights, we crafted the prepare_data_for_modelling function, which is stored in the pre_process.py file. This function includes the feature engineering steps we discussed, setting the stage for effective modeling performance.

Code

def prepare_data_for_modelling(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Prepare data for machine learning modeling.

    This function takes a DataFrame and prepares it for machine learning by performing the following steps:
    1. Randomly shuffles the rows of the DataFrame.
    2. Converts the 'price' column to the base 10 logarithm.
    3. Fills missing values in categorical variables with 'missing value'.
    4. Separates the features (X) and the target (y).
    5. Identifies and filters out outlier values based on LocalOutlierFactor.

    Parameters:
    - df (pd.DataFrame): The input DataFrame containing the dataset.

    Returns:
    - Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y).

    Example use case:
    # Load your dataset into a DataFrame (e.g., df)
    df = load_data()

    # Prepare the data for modeling
    X, y = prepare_data_for_modelling(df)

    # Now you can use X and y for machine learning tasks.

    Args:
        df (pd.DataFrame): The input DataFrame containing the dataset.

    Returns:
        Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y).
    """

    processed_df = (
        df.sample(frac=1, random_state=utils.Configuration.seed)
        .reset_index(drop=True)
        .assign(
            price=lambda df: np.log10(df.price),
            city_group=lambda df: df.groupby("city")["cadastral_income"].transform(
                "median"
            ),
            building_condition_group=lambda df: df.groupby("building_condition")[
                "yearly_theoretical_total_energy_consumption"
            ].transform("median"),
            energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption
            / df.primary_energy_consumption,
            energy_efficiency_2=lambda df: df.primary_energy_consumption
            / df.living_area,
            bargain_1=lambda df: df.cadastral_income / df.bedrooms,
            bargain_2=lambda df: df.cadastral_income / df.living_area,
        )
    )

    # Fill missing categorical variables with "missing value"
    for col in processed_df.columns:
        if processed_df[col].dtype.name in ("bool", "object", "category"):
            processed_df[col] = processed_df[col].fillna("missing value")

    # Separate features (X) and target (y)
    X = processed_df.loc[:, utils.Configuration.features_to_keep_v2]
    y = processed_df[utils.Configuration.target_col]

    outlier_mask = pre_process.identify_outliers(X)

    X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True)
    y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)

    print(f"Shape of X and y with outliers: {X.shape}, {y.shape}")
    print(
        f"Shape of X and y without outliers: {X_wo_outliers.shape}, {y_wo_outliers.shape}"
    )

    return X_wo_outliers, y_wo_outliers

Code

X, y = pre_process.prepare_data_for_modelling(df)

Shape of X and y with outliers: (3660, 14), (3660,)
Shape of X and y without outliers: (3427, 14), (3427,)

In this notebook, we began with an initial selection of 16 features based on the work in Part 5. However, as we conclude this article, we’ve found that we can streamline our feature set even further by removing kitchen_type and toilets resulting in improved performance, thanks to the addition of new features. While there’s potential for further optimizations, such as dimensional reduction, we are currently happy with our progress. In the next, and final, part, we will focus on fine-tuning our model for optimal predictive performance. See you there!