Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

In the preceding Part 3, our emphasis was on establishing a fundamental understanding of our data through characterizing the cleaned scraped dataset. We delved into feature cardinality, distributions, and potential correlations with our target variable—property price. Moving on to Part 4, our agenda includes examining essential sample pre-processing steps before modeling. We will craft the necessary pipeline, assess multiple algorithms, and ultimately select a suitable baseline model. Let’s get started!

Note

You can access the project’s app through its Streamlit website.

Import data

Code

import time
import warnings
from pathlib import Path

import catboost
import lightgbm as lgb
import numpy as np
import pandas as pd
import xgboost
from data import utils
from IPython.display import clear_output, display
from lets_plot import *
from lets_plot.mapping import as_discrete
from sklearn import (
    compose,
    dummy,
    ensemble,
    impute,
    linear_model,
    metrics,
    model_selection,
    pipeline,
    preprocessing,
    svm,
    tree,
)
from sklearn.base import BaseEstimator, TransformerMixin
from tqdm import tqdm

LetsPlot.setup_html()

Prepare dataframe before modelling

Read in the processed file

After importing our preprocessed dataframe, a crucial step in our data refinement process involves the culling of certain columns. Specifically, we intend to exclude columns with labels such as “external_reference,” “ad_url,” “day_of_retrieval,” “website,” “reference_number_of_the_epc_report,” and “housenumber.” Our rationale behind this action is to enhance the efficiency of our model by eliminating potentially non-contributory features.

Code

utils.seed_everything(utils.Configuration.seed)

df = (
    pd.read_parquet(
        utils.Configuration.INTERIM_DATA_PATH.joinpath(
            "2023-10-01_Processed_dataset_for_NB_use.parquet.gzip"
        )
    )
    .sample(frac=1, random_state=utils.Configuration.seed)
    .reset_index(drop=True)
    .assign(price=lambda df: np.log10(df.price))
    .drop(
        columns=[
            "external_reference",
            "ad_url",
            "day_of_retrieval",
            "website",
            "reference_number_of_the_epc_report",
            "housenumber",
        ]
    )
)

print(f"Shape of dataframe after read-in a pre-processing: {df.shape}")
X = df.drop(columns=utils.Configuration.target_col)
y = df[utils.Configuration.target_col]

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

Shape of dataframe after read-in a pre-processing: (3660, 50)
Shape of X: (3660, 49)
Shape of y: (3660,)

Train-test split

The subsequent phase in our data preparation involves the partitioning of our dataset into training and testing subsets. To accomplish this, we’ll leverage the model_selection.train_test_split method. This step ensures that we have distinct sets for model training and evaluation, a fundamental practice in machine learning.

Code

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.2, random_state=utils.Configuration.seed
)

print(f"Shape of X-train: {X_train.shape}")
print(f"Shape of X-test: {X_test.shape}")

Shape of X-train: (2928, 49)
Shape of X-test: (732, 49)

Implementing the data-processing pipeline

In order to compare various machine learning algorithms effectively, our initial approach will involve constructing a straightforward pipeline. This pipeline’s primary objective is to segregate columns based on their data types, recognizing the need for distinct preprocessing steps for continuous (numerical) and categorical variables. To facilitate this process within our scikit-learn pipeline, we will begin by implementing a custom class named FeatureSelector.

The rationale behind this is to establish a structured approach to feature handling. The FeatureSelector class will provide us with a streamlined means to access and process columns based on their data typess.

Code

class FeatureSelector(BaseEstimator, TransformerMixin):
    """
    A transformer for selecting specific columns from a DataFrame.

    This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base.
    It overrides the fit and transform methods from the parent classes.

    Attributes:
        feature_names_in_ (list): The names of the features to select.
        n_features_in_ (int): The number of features to select.

    Methods:
        fit(X, y=None): Fit the transformer. Returns self.
        transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features.
    """

    def __init__(self, feature_names_in_):
        """
        Constructs all the necessary attributes for the FeatureSelector object.

        Args:
            feature_names_in_ (list): The names of the features to select.
        """
        self.feature_names_in_ = feature_names_in_
        self.n_features_in_ = len(feature_names_in_)

    def fit(self, X, y=None):
        """
        Fit the transformer. This method doesn't do anything as no fitting is necessary.

        Args:
            X (DataFrame): The input data.
            y (array-like, optional): The target variable. Defaults to None.

        Returns:
            self: The instance itself.
        """
        return self

    def transform(self, X, y=None):
        """
        Apply the transformation. Selects the features from the input data.

        Args:
            X (DataFrame): The input data.
            y (array-like, optional): The target variable. Defaults to None.

        Returns:
            DataFrame: A DataFrame with only the selected features.
        """
        return X.loc[:, self.feature_names_in_].copy(deep=True)

Code

# Selecting columns by dtypes

numerical_columns = X_train.head().select_dtypes("number").columns.to_list()
categorical_columns = X_train.head().select_dtypes("object").columns.to_list()

Addressing missing values is a crucial preliminary step in our machine learning pipeline, as certain algorithms are sensitive to data gaps. To handle this, we’ll employ imputation techniques tailored to the data types of the columns.

For numerical columns, we’ll adopt the “median” strategy for imputation. This approach involves replacing missing values with the median of the available data in the respective numerical column. It’s a robust choice for handling missing values in numerical data as it’s less sensitive to outliers.

Conversely, for categorical columns, we’ll opt for imputation using the most frequent values in each column. By filling in missing categorical data with the mode (most common value) for that column, we ensure that the imputed values align with the existing categorical distribution, preserving the integrity of the categorical features.

This systematic approach to imputation sets a solid foundation for subsequent machine learning algorithms, ensuring that our dataset is well-prepared for analysis and modeling.

Code

# Prepare pipelines for corresponding columns:
numerical_pipeline = pipeline.Pipeline(
    steps=[
        ("num_selector", FeatureSelector(numerical_columns)),
        ("imputer", impute.SimpleImputer(strategy="median")),
        ("std_scaler", preprocessing.MinMaxScaler()),
    ]
)

categorical_pipeline = pipeline.Pipeline(
    steps=[
        ("cat_selector", FeatureSelector(categorical_columns)),
        ("imputer", impute.SimpleImputer(strategy="most_frequent")),
        (
            "onehot",
            preprocessing.OneHotEncoder(handle_unknown="ignore", sparse_output=True),
        ),
    ]
)

Once we are satisfied with the individual pipelines designed for numerical and categorical feature processing, the next step involves merging them into a unified pipeline using the FeatureUnion method provided by scikit-learn.

Code

# Put all the pipelines inside a FeatureUnion:
data_preprocessing_pipeline = pipeline.FeatureUnion(
    n_jobs=-1,
    transformer_list=[
        ("numerical_pipeline", numerical_pipeline),
        ("categorical_pipeline", categorical_pipeline),
    ],
)

Compare the performance of several algorithms

Bringing all these components together in our machine learning pipeline is the culmination of our data preparation and model evaluation process.

Algorithm Selection: Choose a set of machine learning algorithms that you want to evaluate.
Data Split: Utilize the ShuffleSplit method to generate randomized indices for your data into train and test sets. This ensures randomness in data selection and is crucial for unbiased evaluation.
Model Training and Evaluation: For each selected algorithm follow these steps:
- Fit the model on the training data.
- Evaluate the model using negative mean squared error, root mean squared log error and coefficient of determination as the scoring metric.
- Record the training and test scores, as well as the standard deviation of scores to assess the model’s performance.
- Measure the time taken to fit each model, which provides insights into computational performance.

Code

with warnings.catch_warnings():
    warnings.simplefilter(action="ignore", category=FutureWarning)

    MLA = [
        linear_model.LinearRegression(),
        linear_model.SGDRegressor(),
        linear_model.PassiveAggressiveRegressor(),
        linear_model.RANSACRegressor(),
        linear_model.Lasso(),
        svm.SVR(),
        ensemble.GradientBoostingRegressor(),
        tree.DecisionTreeRegressor(),
        ensemble.RandomForestRegressor(),
        ensemble.ExtraTreesRegressor(),
        ensemble.AdaBoostRegressor(),
        catboost.CatBoostRegressor(silent=True),
        lgb.LGBMRegressor(verbose=-1),
        xgboost.XGBRegressor(verbosity=0),
        dummy.DummyClassifier(),
    ]

    # note: this is an alternative to train_test_split
    cv_split = model_selection.ShuffleSplit(
        n_splits=10, test_size=0.3, train_size=0.6, random_state=0
    )  # run model 10x with 60/30 split intentionally leaving out 10%

    # create table to compare MLA metrics
    MLA_columns = [
        "MLA Name",
        "MLA Parameters",
        "MLA Train RMSE Mean",
        "MLA Test RMSE Mean",
        "MLA Train RMSLE Mean",
        "MLA Test RMSLE Mean",
        "MLA Train R2 Mean",
        "MLA Test R2 Mean",
        "MLA Time",
    ]
    MLA_compare = pd.DataFrame(columns=MLA_columns)

    RMSLE = {
        "RMSLE": metrics.make_scorer(metrics.mean_squared_log_error, squared=False)
    }

    # index through MLA and save performance to table
    row_index = 0
    for alg in tqdm(MLA):
        # set name and parameters
        MLA_name = alg.__class__.__name__
        MLA_compare.loc[row_index, "MLA Name"] = MLA_name
        MLA_compare.loc[row_index, "MLA Parameters"] = str(alg.get_params())

        model_pipeline = pipeline.Pipeline(
            steps=[
                ("data_preprocessing_pipeline", data_preprocessing_pipeline),
                ("model", alg),
            ]
        )

        cv_results = model_selection.cross_validate(
            model_pipeline,
            X_train,
            y_train,
            cv=cv_split,
            scoring={
                "RMSLE": RMSLE["RMSLE"],
                "r2": "r2",
                "neg_mean_squared_error": "neg_mean_squared_error",
            },
            return_train_score=True,
        )

        MLA_compare.loc[row_index, "MLA Time"] = cv_results["fit_time"].mean()
        MLA_compare.loc[row_index, "MLA Train RMSE Mean"] = cv_results[
            "train_neg_mean_squared_error"
        ].mean()
        MLA_compare.loc[row_index, "MLA Test RMSE Mean"] = cv_results[
            "test_neg_mean_squared_error"
        ].mean()

        MLA_compare.loc[row_index, "MLA Train RMSLE Mean"] = cv_results[
            "train_RMSLE"
        ].mean()
        MLA_compare.loc[row_index, "MLA Test RMSLE Mean"] = cv_results[
            "test_RMSLE"
        ].mean()

        MLA_compare.loc[row_index, "MLA Train R2 Mean"] = cv_results["train_r2"].mean()
        MLA_compare.loc[row_index, "MLA Test R2 Mean"] = cv_results["test_r2"].mean()

        row_index += 1

        clear_output(wait=True)
        # display(MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True))
(
    MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True)
    .drop(columns="MLA Parameters")
    .convert_dtypes()
    .set_index("MLA Name")
    .style.set_table_styles(
        [
            {
                "selector": "th.col_heading",
                "props": "text-align: center; font-size: 1.0em;",
            },
            {"selector": "td", "props": "text-align: center;"},
            {
                "selector": "td:hover",
                "props": "font-style: italic; color: black; font-weight:bold; background-color : #ffffb3;",
            },
        ],
        overwrite=False,
    )
    .format(precision=3, thousands=",", decimal=".")
    .background_gradient(cmap="coolwarm", axis=0)
)

100%|██████████| 15/15 [12:46<00:00, 35.82s/it]100%|██████████| 15/15 [12:46<00:00, 51.07s/it]

	MLA Train RMSE Mean	MLA Test RMSE Mean	MLA Train RMSLE Mean	MLA Test RMSLE Mean	MLA Train R2 Mean	MLA Test R2 Mean	MLA Time
MLA Name
CatBoostRegressor	-0.003	-0.013	0.009	0.017	0.970	0.875	11.123
LGBMRegressor	-0.002	-0.015	0.007	0.018	0.978	0.862	0.469
ExtraTreesRegressor	-0.000	-0.016	0.000	0.019	1.000	0.847	22.974
GradientBoostingRegressor	-0.010	-0.016	0.015	0.019	0.910	0.846	1.798
XGBRegressor	-0.001	-0.016	0.005	0.019	0.990	0.843	0.655
RandomForestRegressor	-0.002	-0.017	0.007	0.019	0.978	0.840	19.630
SVR	-0.008	-0.024	0.013	0.023	0.925	0.772	0.317
AdaBoostRegressor	-0.020	-0.024	0.021	0.023	0.814	0.768	0.757
DecisionTreeRegressor	-0.000	-0.035	0.000	0.028	1.000	0.664	0.348
PassiveAggressiveRegressor	-0.020	-0.076	0.021	0.033	0.809	0.278	0.066
SGDRegressor	-0.065	-0.093	0.039	0.045	0.390	0.114	0.060
Lasso	-0.106	-0.105	0.049	0.049	0.000	-0.001	0.057
DummyClassifier	-0.145	-0.147	0.056	0.056	-0.372	-0.400	0.104
LinearRegression	-0.012	-0.392	0.016		0.887	-2.707	1.249
RANSACRegressor	-0.115	-0.234			-0.084	-1.240	12.303

The table above clearly shows that the CatBoostRegressor has performed exceptionally well, achieving the best scores in RMSE, RMSLE, and R2 on the test set. It has outperformed the LGBMRegressor, ExtraTreesRegressor, GradientBoostingRegressor, and even the XGBRegressor.

In the next section, we will dive deeper into optimizing our model. This will involve refining model settings, enhancing features, and employing techniques to improve our overall predictive accuracy. Looking forward to seeing you there!