In the preceding section, our emphasis was on establishing a fundamental understanding of our data through characterizing the cleaned scraped dataset. We dived into feature cardinality, distributions, and potential correlations with our target variable—property price. Moving on to Part 2, our agenda includes examining essential sample pre-processing steps before modeling. We will craft the necessary pipeline, assess multiple algorithms, and ultimately select a suitable baseline model. Let’s get started!

Note

You can explore the project’s app on its website. For more details, visit the GitHub repository.

Check out the series for a deeper dive: - Part 1: Characterizing the Data - Part 2: Building a Baseline Model - Part 3: Feature Selection - Part 4: Feature Engineering - Part 5: Fine-Tuning

Import data

Code

import time
import warnings
from pathlib import Path

import catboost
import lightgbm as lgb
import numpy as np
import pandas as pd
import xgboost
from IPython.display import clear_output, display
from lets_plot import *
from lets_plot.mapping import as_discrete
from sklearn import (
    compose,
    dummy,
    ensemble,
    impute,
    linear_model,
    metrics,
    model_selection,
    pipeline,
    preprocessing,
    svm,
    tree,
)
from sklearn.base import BaseEstimator, TransformerMixin
from tqdm import tqdm

from helper import utils

LetsPlot.setup_html()

Prepare dataframe before modelling

Read in the processed file

Code

utils.seed_everything(utils.Configuration.seed)

df = (
    pd.read_parquet(
        Path.cwd()
        .joinpath("data")
        .joinpath("2023-10-01_Processed_dataset_for_NB_use.gzip")
    )
    .sample(frac=1, random_state=utils.Configuration.seed)
    .reset_index(drop=True)
    .assign(price=lambda df: np.log10(df.price))
)

print(f"Shape of dataframe after read-in a pre-processing: {df.shape}")
X = df.drop(columns=utils.Configuration.target_col)
y = df[utils.Configuration.target_col]

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

Shape of dataframe after read-in a pre-processing: (3660, 50)
Shape of X: (3660, 49)
Shape of y: (3660,)

Train-test split

The subsequent phase in our data preparation involves the partitioning of our dataset into training and testing subsets. To accomplish this, we’ll use the model_selection.train_test_split method. This step ensures that we have distinct sets for model training and evaluation, a fundamental practice in machine learning.

Code

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.2, random_state=utils.Configuration.seed
)

print(f"Shape of X-train: {X_train.shape}")
print(f"Shape of X-test: {X_test.shape}")

Shape of X-train: (2928, 49)
Shape of X-test: (732, 49)

Implementing the data-processing pipeline

In order to compare various machine learning algorithms effectively, our initial approach will involve constructing a straightforward pipeline. This pipeline’s primary objective is to segregate columns based on their data types, recognizing the need for distinct preprocessing steps for continuous (numerical) and categorical variables. To facilitate this process within our scikit-learn pipeline, we will begin by implementing a custom class named FeatureSelector.

The rationale behind this is to establish a structured approach to feature handling. The FeatureSelector class will provide us with a streamlined means to access and process columns based on their data typess.

Code

class FeatureSelector(BaseEstimator, TransformerMixin):
    """
    A transformer for selecting specific columns from a DataFrame.

    This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base.
    It overrides the fit and transform methods from the parent classes.

    Attributes:
        feature_names_in_ (list): The names of the features to select.
        n_features_in_ (int): The number of features to select.

    Methods:
        fit(X, y=None): Fit the transformer. Returns self.
        transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features.
    """

    def __init__(self, feature_names_in_):
        """
        Constructs all the necessary attributes for the FeatureSelector object.

        Args:
            feature_names_in_ (list): The names of the features to select.
        """
        self.feature_names_in_ = feature_names_in_
        self.n_features_in_ = len(feature_names_in_)

    def fit(self, X, y=None):
        """
        Fit the transformer. This method doesn't do anything as no fitting is necessary.

        Args:
            X (DataFrame): The input data.
            y (array-like, optional): The target variable. Defaults to None.

        Returns:
            self: The instance itself.
        """
        return self

    def transform(self, X, y=None):
        """
        Apply the transformation. Selects the features from the input data.

        Args:
            X (DataFrame): The input data.
            y (array-like, optional): The target variable. Defaults to None.

        Returns:
            DataFrame: A DataFrame with only the selected features.
        """
        return X.loc[:, self.feature_names_in_].copy(deep=True)

Code

# Selecting columns by dtypes

numerical_columns = X_train.head().select_dtypes("number").columns.to_list()
categorical_columns = X_train.head().select_dtypes("object").columns.to_list()

Addressing missing values is a crucial preliminary step in our machine learning pipeline, as certain algorithms are sensitive to data gaps. To handle this, we’ll employ imputation techniques tailored to the data types of the columns.

For numerical columns, we’ll adopt the “median” strategy for imputation. This approach involves replacing missing values with the median of the available data in the respective numerical column. It’s a robust choice for handling missing values in numerical data as it’s less sensitive to outliers.

Conversely, for categorical columns, we’ll opt for imputation using the most frequent values in each column. By filling in missing categorical data with the mode (most common value) for that column, we ensure that the imputed values align with the existing categorical distribution, preserving the integrity of the categorical features.

This systematic approach to imputation sets a solid foundation for subsequent machine learning algorithms, ensuring that our dataset is well-prepared for analysis and modeling.

Code

# Prepare pipelines for corresponding columns:
numerical_pipeline = pipeline.Pipeline(
    steps=[
        ("num_selector", FeatureSelector(numerical_columns)),
        ("imputer", impute.SimpleImputer(strategy="median")),
        ("std_scaler", preprocessing.MinMaxScaler()),
    ]
)

categorical_pipeline = pipeline.Pipeline(
    steps=[
        ("cat_selector", FeatureSelector(categorical_columns)),
        ("imputer", impute.SimpleImputer(strategy="most_frequent")),
        (
            "onehot",
            preprocessing.OneHotEncoder(handle_unknown="ignore", sparse_output=True),
        ),
    ]
)

Once we are satisfied with the individual pipelines designed for numerical and categorical feature processing, the next step involves merging them into a unified pipeline using the FeatureUnion method provided by scikit-learn.

Code

# Put all the pipelines inside a FeatureUnion:
data_preprocessing_pipeline = pipeline.FeatureUnion(
    n_jobs=-1,
    transformer_list=[
        ("numerical_pipeline", numerical_pipeline),
        ("categorical_pipeline", categorical_pipeline),
    ],
)

Compare the performance of several algorithms

Bringing all these components together in our machine learning pipeline is the culmination of our data preparation and model evaluation process.

Algorithm SelectionWe choose a set of machine learning algorithms that we want to evaluate.
Data Split: Here we use the ShuffleSplit method to generate randomized indices for our data into training and test sets. This ensures randomness in data selection and is crucial for unbiased evaluation.
Model Training and Evaluation: For each selected algorwe followfollow these steps:
- Fit the model on the training data.
- Evaluate the model using negative mean squared error (neg_mean_squared_error), root mean squared log error (mean_squared_log_error) and coefficient of determination (r2_score) as the scoring metric.
- Record the training and test scores, as well as the standard deviation of scores
- Measure the time taken to fit each model, which provides insights into computational peformance.formance.

Code

with warnings.catch_warnings():
    warnings.simplefilter(action="ignore", category=FutureWarning)

    MLA = [
        linear_model.LinearRegression(),
        linear_model.SGDRegressor(),
        linear_model.PassiveAggressiveRegressor(),
        linear_model.RANSACRegressor(),
        linear_model.Lasso(),
        svm.SVR(),
        ensemble.GradientBoostingRegressor(),
        tree.DecisionTreeRegressor(),
        ensemble.RandomForestRegressor(),
        ensemble.ExtraTreesRegressor(),
        ensemble.AdaBoostRegressor(),
        catboost.CatBoostRegressor(silent=True),
        lgb.LGBMRegressor(verbose=-1),
        xgboost.XGBRegressor(verbosity=0),
        dummy.DummyRegressor(),
    ]

    # note: this is an alternative to train_test_split
    cv_split = model_selection.ShuffleSplit(
        n_splits=10, test_size=0.3, train_size=0.6, random_state=0
    )  # run model 10x with 60/30 split intentionally leaving out 10%

    # create table to compare MLA metrics
    MLA_columns = [
        "MLA Name",
        "MLA Parameters",
        "MLA Train RMSE Mean",
        "MLA Test RMSE Mean",
        "MLA Train RMSLE Mean",
        "MLA Test RMSLE Mean",
        "MLA Train R2 Mean",
        "MLA Test R2 Mean",
        "MLA Time",
    ]
    MLA_compare = pd.DataFrame(columns=MLA_columns)

    RMSLE = {
        "RMSLE": metrics.make_scorer(metrics.mean_squared_log_error, squared=False)
    }

    # index through MLA and save performance to table
    row_index = 0
    for alg in tqdm(MLA):
        # set name and parameters
        MLA_name = alg.__class__.__name__
        MLA_compare.loc[row_index, "MLA Name"] = MLA_name
        MLA_compare.loc[row_index, "MLA Parameters"] = str(alg.get_params())

        model_pipeline = pipeline.Pipeline(
            steps=[
                ("data_preprocessing_pipeline", data_preprocessing_pipeline),
                ("model", alg),
            ]
        )

        cv_results = model_selection.cross_validate(
            model_pipeline,
            X_train,
            y_train,
            cv=cv_split,
            scoring={
                "RMSLE": RMSLE["RMSLE"],
                "r2": "r2",
                "neg_mean_squared_error": "neg_mean_squared_error",
            },
            return_train_score=True,
        )

        MLA_compare.loc[row_index, "MLA Time"] = cv_results["fit_time"].mean()
        MLA_compare.loc[row_index, "MLA Train RMSE Mean"] = cv_results[
            "train_neg_mean_squared_error"
        ].mean()
        MLA_compare.loc[row_index, "MLA Test RMSE Mean"] = cv_results[
            "test_neg_mean_squared_error"
        ].mean()

        MLA_compare.loc[row_index, "MLA Train RMSLE Mean"] = cv_results[
            "train_RMSLE"
        ].mean()
        MLA_compare.loc[row_index, "MLA Test RMSLE Mean"] = cv_results[
            "test_RMSLE"
        ].mean()

        MLA_compare.loc[row_index, "MLA Train R2 Mean"] = cv_results["train_r2"].mean()
        MLA_compare.loc[row_index, "MLA Test R2 Mean"] = cv_results["test_r2"].mean()

        row_index += 1

        clear_output(wait=True)
        # display(MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True))
(
    MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True)
    .drop(columns="MLA Parameters")
    .convert_dtypes()
    .set_index("MLA Name")
    .style.set_table_styles(
        [
            {
                "selector": "th.col_heading",
                "props": "text-align: center; font-size: 1.0em;",
            },
            {"selector": "td", "props": "text-align: center;"},
            {
                "selector": "td:hover",
                "props": "font-style: italic; color: black; font-weight:bold; background-color : #ffffb3;",
            },
        ],
        overwrite=False,
    )
    .format(precision=3, thousands=",", decimal=".")
    .background_gradient(cmap="coolwarm", axis=0)
)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [10:47<00:00, 43.16s/it]

	MLA Train RMSE Mean	MLA Test RMSE Mean	MLA Train RMSLE Mean	MLA Test RMSLE Mean	MLA Train R2 Mean	MLA Test R2 Mean	MLA Time
MLA Name
CatBoostRegressor	-0.003	-0.013	0.009	0.017	0.970	0.875	3.235
LGBMRegressor	-0.002	-0.015	0.007	0.018	0.978	0.862	0.135
GradientBoostingRegressor	-0.010	-0.016	0.015	0.019	0.910	0.846	1.621
ExtraTreesRegressor	-0.000	-0.016	0.000	0.019	1.000	0.846	11.774
XGBRegressor	-0.001	-0.016	0.005	0.019	0.990	0.843	0.177
RandomForestRegressor	-0.002	-0.017	0.007	0.019	0.978	0.839	14.718
SVR	-0.008	-0.024	0.013	0.023	0.925	0.772	0.417
AdaBoostRegressor	-0.020	-0.024	0.021	0.023	0.813	0.770	0.382
DecisionTreeRegressor	-0.000	-0.035	0.000	0.028	1.000	0.668	0.289
PassiveAggressiveRegressor	-0.020	-0.076	0.021	0.033	0.809	0.278	0.077
SGDRegressor	-0.065	-0.093	0.039	0.045	0.390	0.114	0.070
Lasso	-0.106	-0.105	0.049	0.049	0.000	-0.001	0.038
DummyRegressor	-0.106	-0.105	0.049	0.049	0.000	-0.001	0.025
LinearRegression	-0.012	-0.392	0.016		0.887	-2.707	2.690
RANSACRegressor	-0.119	-0.213			-0.122	-1.040	21.394

The table above clearly shows that the CatBoostRegressor has performed exceptionally well, achieving the best scores in RMSE, RMSLE, and R2 on the test set. It has outperformed the LGBMRegressor, ExtraTreesRegressor, GradientBoostingRegressor, and even the XGBRegressor.

In the next section, we will dive deeper into optimizing our model. This will involve refining model settings, enhancing features, and employing techniques to improve our overall predictive accuracy. Looking forward to seeing you there!