Predicting Belgian Real Estate Prices: Part 7: Fine-tuning our model

Predicting Belgian Real Estate Prices
Author

Adam Cseresznye

Published

November 9, 2023

Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

In Part 6, we established a robust cross-validation strategy to consistently assess our model’s performance across multiple data folds. We also identified and managed potential outliers in our dataset. Additionally, we explored diverse feature engineering methods, creating and evaluating informative features to enhance our model’s predictive capabilities.

In this final segment, we’ll optimize our hyperparameters using Optuna and end by evaluating the final model’s performance based on the test portion. Let’s dive in!

Note

You can access the project’s app through its Streamlit website.

Import data

Code
import gc
import os
from pathlib import Path
from typing import Any, List, Optional, Tuple

import catboost
import numpy as np
import pandas as pd
from data import pre_process, update_database, utils
from features import feature_engineering
from lets_plot import *
from lets_plot.mapping import as_discrete
from models import predict_model, train_model
from sklearn import metrics, model_selection
from tqdm.notebook import tqdm

LetsPlot.setup_html()
import pickle

import optuna

Up to this point, we’ve successfully obtained all the house advertisements from ImmoWeb. We’ve conducted data description, preselected features, established an effective data pre-processing pipeline, and identified the most suitable machine learning algorithm for this task. Additionally, we’ve engaged in feature engineering and carried out further feature selection to streamline our machine learning model. The final pha, that’s yet to be done,se involves fine-tuning the hyperparameters of our machine learning model, enhancing predictive accuracy while mitigating overfittin

Prepare dataframe before modelling

Let’s get our data ready for modeling by applying the prepare_data_for_modelling function as detailed in Part 6. A quick recap: this function carries out the subsequent actions to prepare a DataFrame for machine learning:

  1. It randomly shuffles the DataFrame’s rows.
  2. The ‘price’ column is transformed into the base 10 logarithm.
  3. Categorical variable missing values are replaced with ‘missing value.’
  4. It divides the data into features (X) and the target (y).
  5. Using LocalOutlierFactor, it identifies and removes outlier values.
Code
df = pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "2023-10-01_Processed_dataset_for_NB_use.parquet.gzip"
    )
)

X, y = pre_process.prepare_data_for_modelling(df)
Shape of X and y with outliers: (3660, 14), (3660,)
Shape of X and y without outliers: (3427, 14), (3427,)

We’ll divide the data into training and testing sets. The training portion will be dedicated to hyperparameter tuning. It’s worth noting that, to guard against overfitting during hyperparameter tuning, we’ll implement cross-validation. This involves splitting the training set into subgroups for training and validation. The validation portion helps prevent overfitting, and the training continues until we achieve the desired performance. The test set will come into play later for evaluating our final model.

Code
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=utils.Configuration.seed,
)

print(f"Shape of X_train: {X_train.shape}, Shape of X_test: {X_test.shape}")
Shape of X_train: (2741, 14), Shape of X_test: (686, 14)

We’ll also create a handy helper function called dumper. This function enables us to save the best parameters discovered during tuning as a .pickle file, allowing us to load and utilize these parameters from the saved file at a later time.

Code
def dumper(file: Any, name: str) -> None:
    """
    Pickle and save an object to a file.

    This function takes an object and a name, then uses the Pickle library to serialize
    and save the object to a file with the given name. The file is saved in binary mode ('wb').

    Args:
        file (Any): The object to be pickled and saved.
        name (str): The name of the file, including the file extension, where the object will be saved.

    Returns:
        None: This function does not return a value.

    Example:
        To save an object to a file:
        >>> my_data = [1, 2, 3]
        >>> dumper(my_data, "my_data.pickle")

    Note:
        The file is saved in binary mode ('wb') to ensure compatibility and proper
        handling of binary data.
    """
    pickle.dump(file, open(f"{name}.pickle", "wb"))

Hyperparameter tuning using Optuna

To identify the most optimal settings, we’ll leverage the Optuna library. The key hyperparameters under consideration are iterations, depth, learning_rate, random_strength, bagging_temperature, and others.

Code
def objective(trial: optuna.Trial) -> float:
    """
    Optuna objective function for tuning CatBoost hyperparameters.

    This function takes an Optuna trial and explores hyperparameters for a CatBoost
    model to minimize the Root Mean Squared Error (RMSE) using K-Fold cross-validation.

    Parameters:
    - trial (optuna.Trial): Optuna trial object for hyperparameter optimization.

    Returns:
    - float: Mean RMSE across K-Fold cross-validation iterations.

    Example use case:
    # Create an Optuna study and optimize hyperparameters
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=100)

    # Get the best hyperparameters
    best_params = study.best_params
    """
    catboost_params = {
        "iterations": trial.suggest_int("iterations", 10, 1000),
        "depth": trial.suggest_int("depth", 1, 8),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 1),
        "random_strength": trial.suggest_float("random_strength", 1e-9, 10),
        "bagging_temperature": trial.suggest_float("bagging_temperature", 0, 1),
        "l2_leaf_reg": trial.suggest_int("l2_leaf_reg", 2, 30),
        "border_count": trial.suggest_int("border_count", 1, 255),
        "thread_count": os.cpu_count(),
    }

    results = []
    optuna.logging.set_verbosity(optuna.logging.WARNING)

    # Extract feature names and data types
    # features = X.columns[~X.columns.str.contains("price")]
    # numerical_features = X.select_dtypes("number").columns.to_list()
    categorical_features = X.select_dtypes("object").columns.to_list()

    # Create a K-Fold cross-validator
    CV = model_selection.RepeatedKFold(
        n_splits=10, n_repeats=1, random_state=utils.Configuration.seed
    )

    for train_fold_index, val_fold_index in CV.split(X):
        X_train_fold, X_val_fold = X.loc[train_fold_index], X.loc[val_fold_index]
        y_train_fold, y_val_fold = y.loc[train_fold_index], y.loc[val_fold_index]

        # Create CatBoost datasets
        catboost_train = catboost.Pool(
            X_train_fold,
            y_train_fold,
            cat_features=categorical_features,
        )
        catboost_valid = catboost.Pool(
            X_val_fold,
            y_val_fold,
            cat_features=categorical_features,
        )

        # Initialize and train the CatBoost model
        model = catboost.CatBoostRegressor(**catboost_params)
        model.fit(
            catboost_train,
            eval_set=[catboost_valid],
            early_stopping_rounds=utils.Configuration.early_stopping_round,
            verbose=utils.Configuration.verbose,
            use_best_model=True,
        )

        # Calculate OOF validation predictions
        valid_pred = model.predict(X_val_fold)

        RMSE_score = metrics.mean_squared_error(y_val_fold, valid_pred, squared=False)

        del (
            X_train_fold,
            y_train_fold,
            X_val_fold,
            y_val_fold,
            catboost_train,
            catboost_valid,
            model,
            valid_pred,
        )
        gc.collect()

        results.append(RMSE_score)
    return np.mean(results)
Note

Similar to Part 6, the hyperparameter optimization step was pre-computed due to the significant computational time needed. The results were saved rather than executed during notebook rendering to save time. However, note that the outcomes should remain unchanged.

Code
%%script echo skipping

study = optuna.create_study(direction="minimize")
study.optimize(train_model.Optuna_Objective(X_train, y_train), n_trials=100, show_progress_bar=True)

dumper(study.best_params, "CatBoost_params")
dumper(study.best_value, "CatBoost_value")
Couldn't find program: 'echo'

As shown below, Optuna found the best Out-Of-Fold (OOF) score using the selected parameters, which is 0.1060. Recall that in Part 6, our best OOF score was 0.1070, so this represents a modest improvement, albeit a slight one.

Code
catboost_params_optuna = pd.read_pickle("CatBoost_params.pickle")

print(
    f'The best OOF RMSE score of the hyperparameter tuning is {pd.read_pickle("CatBoost_value.pickle"):.4f}.'
)
print(f"The corresponding values: {catboost_params_optuna}")
The best OOF RMSE score of the hyperparameter tuning is 0.1060.
The corresponding values: {'iterations': 956, 'depth': 7, 'learning_rate': 0.050050595110243595, 'random_strength': 7.110744896133362, 'bagging_temperature': 0.024119607385698107, 'l2_leaf_reg': 3, 'border_count': 205}

Retrain using the best parameters and predict

After obtaining the most optimal parameters, we can proceed to retrain our model using the entire dataset, excluding the test portion, of course. For this we can use the train_catboost as seen below.

Code
def train_catboost(
    X: pd.DataFrame, y: pd.Series, catboost_params: dict
) -> catboost.CatBoostRegressor:
    """
    Train a CatBoostRegressor model using the provided data and parameters.

    Parameters:
        X (pd.DataFrame): The feature dataset.
        y (pd.Series): The target variable.
        catboost_params (dict): CatBoost hyperparameters.

    Returns:
        CatBoostRegressor: The trained CatBoost model.

    This function takes the feature dataset `X`, the target variable `y`, and a dictionary of CatBoost
    hyperparameters. It automatically detects categorical features in the dataset and trains a CatBoostRegressor
    model with the specified parameters.

    Example:
        X, y = load_data()
        catboost_params = {
            'iterations': 1000,
            'learning_rate': 0.1,
            'depth': 6,
            # ... other hyperparameters ...
        }
        model = train_catboost(X, y, catboost_params)
    """
    categorical_features = X.select_dtypes("object").columns.to_list()

    catboost_train = catboost.Pool(
        X,
        y,
        cat_features=categorical_features,
    )

    model = catboost.CatBoostRegressor(**catboost_params)
    model.fit(
        catboost_train,
        verbose=utils.Configuration.verbose,
    )

    return model
Code
model = train_model.train_catboost(X_train, y_train, catboost_params_optuna)

Excellent! We’ve made good progress. Now, it’s time for the final evaluation of our dataset using the test set. We can use the predict_catboost function for this.

Code
def predict_catboost(
    model: catboost.CatBoostRegressor,
    X: pd.DataFrame,
    thread_count: int = -1,
    verbose: int = None,
) -> np.ndarray:
    """
    Make predictions using a CatBoost model on the provided dataset.

    Parameters:
        model (catboost.CatBoostRegressor): The trained CatBoost model.
        X (pd.DataFrame): The dataset for which predictions are to be made.
        thread_count (int, optional): The number of threads for prediction. Default is -1 (auto).
        verbose (int, optional): Verbosity level. Default is None.

    Returns:
        np.ndarray: Predicted values.

    This function takes a trained CatBoost model, a dataset `X`, and optional parameters for
    specifying the number of threads (`thread_count`) and verbosity (`verbose`) during prediction.
    It returns an array of predicted values.

    Example:
        model = load_catboost_model()
        X_new = load_new_data()
        predictions = predict_catboost(model, X_new, thread_count=4, verbose=2)
    """
    prediction = model.predict(data=X, thread_count=thread_count, verbose=verbose)
    return prediction

To assess the predictions, we’ll obtain both RMSE and R2 values.

Code
prediction = predict_model.predict_catboost(model=model, X=X_test)
Code
def score_prediction(y_true, y_pred):
    """
    Calculate regression evaluation metrics based on
    true and predicted values.

    Parameters:
        y_true (array-like): True target values.
        y_pred (array-like): Predicted values.

    Returns:
        tuple: A tuple containing Root Mean Squared Error (RMSE)
        and R-squared (R2).

    This function calculates RMSE and R2 to evaluate the goodness
    of fit between the true target values and predicted values.

    Example:
        y_true = [3, 5, 7, 9]
        y_pred = [2.8, 5.2, 7.1, 9.3]
        rmse, r2 = score_prediction(y_true, y_pred)
    """
    RMSE = np.sqrt(metrics.mean_squared_error(y_true, np.log10(y_pred)))
    R2 = metrics.r2_score(y_true, y_pred)

    return RMSE, R2

Superb! As you can see, the test set has an RMSE of 0.1101 and an R2 of 0.877. It’s worth noting that the test set’s RMSE is slightly higher than that of the training set, which is expected and suggests overfitting. Despite our efforts to prevent overfitting, it can be challenging to eliminate entirely. Nevertheless, it appears that we’ve done well.

Code
predict_model.score_prediction(y_pred=prediction, y_true=y_test)
(0.1101569172024175, 0.8771378054646048)

Let’s put the original values and prediction in a DataFrame so that we can evaluate them visually as well.

Code
results = (
    pd.concat(
        [y_test.reset_index(drop=True), pd.Series(prediction)], axis="columns"
    ).rename(columns={"price": "original_values", 0: "predicted_values"})
    # .apply(lambda x: 10**x)
    .assign(residuals=lambda df: df.original_values - df.predicted_values)
)
results
original_values predicted_values residuals
0 5.201397 5.198018 0.003380
1 5.445604 5.352777 0.092827
2 6.040998 6.023609 0.017389
3 6.021189 6.137714 -0.116525
4 5.911158 5.862038 0.049119
... ... ... ...
681 5.812913 5.776833 0.036081
682 5.755875 5.678560 0.077315
683 5.290035 5.193649 0.096385
684 5.542825 5.545253 -0.002428
685 5.475671 5.543814 -0.068143

686 rows × 3 columns

As depicted below, our model demonstrates the ability to generalize effectively for unseen data, showcasing high R2 values and low RMSE. Additionally, examining the residuals reveals an even distribution, symbolizing robust model performance.

Code
(
    results.pipe(
        lambda df: ggplot(df, aes("original_values", "predicted_values"))
        + geom_point()
        + geom_smooth()
        + geom_text(
            x=5,
            y=6.6,
            label=f"RMSE = {predict_model.score_prediction(y_pred=prediction, y_true=y_test)[0]:.4f}",
            fontface="bold",
        )
        + geom_text(
            x=4.965,
            y=6.5,
            label=f"R2 = {predict_model.score_prediction(y_pred=prediction, y_true=y_test)[1]:.4f}",
            fontface="bold",
        )
        + labs(
            title="Contrasting Predicted House Prices with Actual House Prices",
            subtitle=""" The plot suggests that the model makes accurate predictions on the test data. This is evident from the low RMSE values, 
            signifying a high level of accuracy. Additionally, the high R2 value indicates that the model effectively accounts for a 
            substantial portion of the data's variance, demonstrating a strong alignment between the model's predictions and the actual data.
            """,
            x="log10 True Prices (EUR)",
            y="log10 Predicted Prices (EUR)",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 600)
    )
)
Figure 1: Contrasting Predicted House Prices with Actual House Prices
Code
(
    results.pipe(lambda df: ggplot(df, aes("residuals")) + geom_histogram(stat="bin"))
    + labs(
        title="Assessing the Residuals from the Catboost Model",
        subtitle=""" Normally distributed residuals imply consistent and accurate model predictions, aligning with statistical assumptions.
            """,
        x="Distribution of Residuals",
        y="",
        caption="https://www.immoweb.be/",
    )
    + theme(
        plot_subtitle=element_text(
            size=12, face="italic"
        ),  # Customize subtitle appearance
        plot_title=element_text(size=15, face="bold"),  # Customize title appearance
    )
    + ggsize(800, 600)
)
Figure 2: Assessing the Residuals from the Catboost Model

And there you have it! We reached the end of these series! 🥳🎆🎉🍾🍻🕺

Over these seven articles, we’ve shown how to build a reliable, high-performing machine learning model for real-time house price prediction. While there’s always room for improvement, like exploring geolocation-based feature engineering, blending, and stacking, our aim was to provide a comprehensive guide from start to finish. We hope you’ve enjoyed this journey and gained inspiration and insights for your own projects.

Until next time! 💻🐍🐼