In the preceding Part 3, our emphasis was on establishing a fundamental understanding of our data through characterizing the cleaned scraped dataset. We delved into feature cardinality, distributions, and potential correlations with our target variable—property price. Moving on to Part 4, our agenda includes examining essential sample pre-processing steps before modeling. We will craft the necessary pipeline, assess multiple algorithms, and ultimately select a suitable baseline model. Let’s get started!
After importing our preprocessed dataframe, a crucial step in our data refinement process involves the culling of certain columns. Specifically, we intend to exclude columns with labels such as “external_reference,” “ad_url,” “day_of_retrieval,” “website,” “reference_number_of_the_epc_report,” and “housenumber.” Our rationale behind this action is to enhance the efficiency of our model by eliminating potentially non-contributory features.
Code
utils.seed_everything(utils.Configuration.seed)df = ( pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("2023-10-01_Processed_dataset_for_NB_use.parquet.gzip" ) ) .sample(frac=1, random_state=utils.Configuration.seed) .reset_index(drop=True) .assign(price=lambda df: np.log10(df.price)) .drop( columns=["external_reference","ad_url","day_of_retrieval","website","reference_number_of_the_epc_report","housenumber", ] ))print(f"Shape of dataframe after read-in a pre-processing: {df.shape}")X = df.drop(columns=utils.Configuration.target_col)y = df[utils.Configuration.target_col]print(f"Shape of X: {X.shape}")print(f"Shape of y: {y.shape}")
Shape of dataframe after read-in a pre-processing: (3660, 50)
Shape of X: (3660, 49)
Shape of y: (3660,)
Train-test split
The subsequent phase in our data preparation involves the partitioning of our dataset into training and testing subsets. To accomplish this, we’ll leverage the model_selection.train_test_split method. This step ensures that we have distinct sets for model training and evaluation, a fundamental practice in machine learning.
Code
X_train, X_test, y_train, y_test = model_selection.train_test_split( X, y, test_size=0.2, random_state=utils.Configuration.seed)print(f"Shape of X-train: {X_train.shape}")print(f"Shape of X-test: {X_test.shape}")
Shape of X-train: (2928, 49)
Shape of X-test: (732, 49)
Implementing the data-processing pipeline
In order to compare various machine learning algorithms effectively, our initial approach will involve constructing a straightforward pipeline. This pipeline’s primary objective is to segregate columns based on their data types, recognizing the need for distinct preprocessing steps for continuous (numerical) and categorical variables. To facilitate this process within our scikit-learn pipeline, we will begin by implementing a custom class named FeatureSelector.
The rationale behind this is to establish a structured approach to feature handling. The FeatureSelector class will provide us with a streamlined means to access and process columns based on their data typess.
Code
class FeatureSelector(BaseEstimator, TransformerMixin):""" A transformer for selecting specific columns from a DataFrame. This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base. It overrides the fit and transform methods from the parent classes. Attributes: feature_names_in_ (list): The names of the features to select. n_features_in_ (int): The number of features to select. Methods: fit(X, y=None): Fit the transformer. Returns self. transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features. """def__init__(self, feature_names_in_):""" Constructs all the necessary attributes for the FeatureSelector object. Args: feature_names_in_ (list): The names of the features to select. """self.feature_names_in_ = feature_names_in_self.n_features_in_ =len(feature_names_in_)def fit(self, X, y=None):""" Fit the transformer. This method doesn't do anything as no fitting is necessary. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: self: The instance itself. """returnselfdef transform(self, X, y=None):""" Apply the transformation. Selects the features from the input data. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: DataFrame: A DataFrame with only the selected features. """return X.loc[:, self.feature_names_in_].copy(deep=True)
Code
# Selecting columns by dtypesnumerical_columns = X_train.head().select_dtypes("number").columns.to_list()categorical_columns = X_train.head().select_dtypes("object").columns.to_list()
Addressing missing values is a crucial preliminary step in our machine learning pipeline, as certain algorithms are sensitive to data gaps. To handle this, we’ll employ imputation techniques tailored to the data types of the columns.
For numerical columns, we’ll adopt the “median” strategy for imputation. This approach involves replacing missing values with the median of the available data in the respective numerical column. It’s a robust choice for handling missing values in numerical data as it’s less sensitive to outliers.
Conversely, for categorical columns, we’ll opt for imputation using the most frequent values in each column. By filling in missing categorical data with the mode (most common value) for that column, we ensure that the imputed values align with the existing categorical distribution, preserving the integrity of the categorical features.
This systematic approach to imputation sets a solid foundation for subsequent machine learning algorithms, ensuring that our dataset is well-prepared for analysis and modeling.
Once we are satisfied with the individual pipelines designed for numerical and categorical feature processing, the next step involves merging them into a unified pipeline using the FeatureUnion method provided by scikit-learn.
Code
# Put all the pipelines inside a FeatureUnion:data_preprocessing_pipeline = pipeline.FeatureUnion( n_jobs=-1, transformer_list=[ ("numerical_pipeline", numerical_pipeline), ("categorical_pipeline", categorical_pipeline), ],)
Compare the performance of several algorithms
Bringing all these components together in our machine learning pipeline is the culmination of our data preparation and model evaluation process.
Algorithm Selection: Choose a set of machine learning algorithms that you want to evaluate.
Data Split: Utilize the ShuffleSplit method to generate randomized indices for your data into train and test sets. This ensures randomness in data selection and is crucial for unbiased evaluation.
Model Training and Evaluation: For each selected algorithm follow these steps:
Fit the model on the training data.
Evaluate the model using negative mean squared error, root mean squared log error and coefficient of determination as the scoring metric.
Record the training and test scores, as well as the standard deviation of scores to assess the model’s performance.
Measure the time taken to fit each model, which provides insights into computational performance.
Code
with warnings.catch_warnings(): warnings.simplefilter(action="ignore", category=FutureWarning) MLA = [ linear_model.LinearRegression(), linear_model.SGDRegressor(), linear_model.PassiveAggressiveRegressor(), linear_model.RANSACRegressor(), linear_model.Lasso(), svm.SVR(), ensemble.GradientBoostingRegressor(), tree.DecisionTreeRegressor(), ensemble.RandomForestRegressor(), ensemble.ExtraTreesRegressor(), ensemble.AdaBoostRegressor(), catboost.CatBoostRegressor(silent=True), lgb.LGBMRegressor(verbose=-1), xgboost.XGBRegressor(verbosity=0), dummy.DummyClassifier(), ]# note: this is an alternative to train_test_split cv_split = model_selection.ShuffleSplit( n_splits=10, test_size=0.3, train_size=0.6, random_state=0 ) # run model 10x with 60/30 split intentionally leaving out 10%# create table to compare MLA metrics MLA_columns = ["MLA Name","MLA Parameters","MLA Train RMSE Mean","MLA Test RMSE Mean","MLA Train RMSLE Mean","MLA Test RMSLE Mean","MLA Train R2 Mean","MLA Test R2 Mean","MLA Time", ] MLA_compare = pd.DataFrame(columns=MLA_columns) RMSLE = {"RMSLE": metrics.make_scorer(metrics.mean_squared_log_error, squared=False) }# index through MLA and save performance to table row_index =0for alg in tqdm(MLA):# set name and parameters MLA_name = alg.__class__.__name__ MLA_compare.loc[row_index, "MLA Name"] = MLA_name MLA_compare.loc[row_index, "MLA Parameters"] =str(alg.get_params()) model_pipeline = pipeline.Pipeline( steps=[ ("data_preprocessing_pipeline", data_preprocessing_pipeline), ("model", alg), ] ) cv_results = model_selection.cross_validate( model_pipeline, X_train, y_train, cv=cv_split, scoring={"RMSLE": RMSLE["RMSLE"],"r2": "r2","neg_mean_squared_error": "neg_mean_squared_error", }, return_train_score=True, ) MLA_compare.loc[row_index, "MLA Time"] = cv_results["fit_time"].mean() MLA_compare.loc[row_index, "MLA Train RMSE Mean"] = cv_results["train_neg_mean_squared_error" ].mean() MLA_compare.loc[row_index, "MLA Test RMSE Mean"] = cv_results["test_neg_mean_squared_error" ].mean() MLA_compare.loc[row_index, "MLA Train RMSLE Mean"] = cv_results["train_RMSLE" ].mean() MLA_compare.loc[row_index, "MLA Test RMSLE Mean"] = cv_results["test_RMSLE" ].mean() MLA_compare.loc[row_index, "MLA Train R2 Mean"] = cv_results["train_r2"].mean() MLA_compare.loc[row_index, "MLA Test R2 Mean"] = cv_results["test_r2"].mean() row_index +=1 clear_output(wait=True)# display(MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True))( MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True) .drop(columns="MLA Parameters") .convert_dtypes() .set_index("MLA Name") .style.set_table_styles( [ {"selector": "th.col_heading","props": "text-align: center; font-size: 1.0em;", }, {"selector": "td", "props": "text-align: center;"}, {"selector": "td:hover","props": "font-style: italic; color: black; font-weight:bold; background-color : #ffffb3;", }, ], overwrite=False, ) .format(precision=3, thousands=",", decimal=".") .background_gradient(cmap="coolwarm", axis=0))
The table above clearly shows that the CatBoostRegressor has performed exceptionally well, achieving the best scores in RMSE, RMSLE, and R2 on the test set. It has outperformed the LGBMRegressor, ExtraTreesRegressor, GradientBoostingRegressor, and even the XGBRegressor.
In the next section, we will dive deeper into optimizing our model. This will involve refining model settings, enhancing features, and employing techniques to improve our overall predictive accuracy. Looking forward to seeing you there!
Source Code
---title: 'Predicting Belgian Real Estate Prices: Part 4: Building a Baseline Model'author: Adam Cseresznyedate: '2023-11-06'categories: - Predicting Belgian Real Estate Pricesjupyter: python3toc: trueformat: html: code-fold: true code-tools: true---![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](https://cf.bstatic.com/xdata/images/hotel/max1024x768/408003083.jpg?k=c49b5c4a2346b3ab002b9d1b22dbfb596cee523b53abef2550d0c92d0faf2d8b&o=&hp=1){fig-align="center" width=50%}In the preceding Part 3, our emphasis was on establishing a fundamental understanding of our data through characterizing the cleaned scraped dataset. We delved into feature cardinality, distributions, and potential correlations with our target variable—property price. Moving on to Part 4, our agenda includes examining essential sample pre-processing steps before modeling. We will craft the necessary pipeline, assess multiple algorithms, and ultimately select a suitable baseline model. Let's get started!::: {.callout-note}You can access the project's app through its [Streamlit website](https://belgian-house-price-predictor.streamlit.app/).:::# Import data```{python}#| editable: true#| slideshow: {slide_type: ''}#| tags: []import timeimport warningsfrom pathlib import Pathimport catboostimport lightgbm as lgbimport numpy as npimport pandas as pdimport xgboostfrom data import utilsfrom IPython.display import clear_output, displayfrom lets_plot import*from lets_plot.mapping import as_discretefrom sklearn import ( compose, dummy, ensemble, impute, linear_model, metrics, model_selection, pipeline, preprocessing, svm, tree,)from sklearn.base import BaseEstimator, TransformerMixinfrom tqdm import tqdmLetsPlot.setup_html()```# Prepare dataframe before modelling## Read in the processed fileAfter importing our preprocessed dataframe, a crucial step in our data refinement process involves the culling of certain columns. Specifically, we intend to exclude columns with labels such as "external_reference," "ad_url," "day_of_retrieval," "website," "reference_number_of_the_epc_report," and "housenumber." Our rationale behind this action is to enhance the efficiency of our model by eliminating potentially non-contributory features.```{python}utils.seed_everything(utils.Configuration.seed)df = ( pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("2023-10-01_Processed_dataset_for_NB_use.parquet.gzip" ) ) .sample(frac=1, random_state=utils.Configuration.seed) .reset_index(drop=True) .assign(price=lambda df: np.log10(df.price)) .drop( columns=["external_reference","ad_url","day_of_retrieval","website","reference_number_of_the_epc_report","housenumber", ] ))print(f"Shape of dataframe after read-in a pre-processing: {df.shape}")X = df.drop(columns=utils.Configuration.target_col)y = df[utils.Configuration.target_col]print(f"Shape of X: {X.shape}")print(f"Shape of y: {y.shape}")```## Train-test splitThe subsequent phase in our data preparation involves the partitioning of our dataset into training and testing subsets. To accomplish this, we'll leverage the `model_selection.train_test_split` method. This step ensures that we have distinct sets for model training and evaluation, a fundamental practice in machine learning.```{python}X_train, X_test, y_train, y_test = model_selection.train_test_split( X, y, test_size=0.2, random_state=utils.Configuration.seed)print(f"Shape of X-train: {X_train.shape}")print(f"Shape of X-test: {X_test.shape}")```# Implementing the data-processing pipelineIn order to compare various machine learning algorithms effectively, our initial approach will involve constructing a straightforward pipeline. This pipeline's primary objective is to segregate columns based on their data types, recognizing the need for distinct preprocessing steps for continuous (numerical) and categorical variables. To facilitate this process within our scikit-learn pipeline, we will begin by implementing a custom class named `FeatureSelector`.The rationale behind this is to establish a structured approach to feature handling. The `FeatureSelector` class will provide us with a streamlined means to access and process columns based on their data typess.```{python}class FeatureSelector(BaseEstimator, TransformerMixin):""" A transformer for selecting specific columns from a DataFrame. This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base. It overrides the fit and transform methods from the parent classes. Attributes: feature_names_in_ (list): The names of the features to select. n_features_in_ (int): The number of features to select. Methods: fit(X, y=None): Fit the transformer. Returns self. transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features. """def__init__(self, feature_names_in_):""" Constructs all the necessary attributes for the FeatureSelector object. Args: feature_names_in_ (list): The names of the features to select. """self.feature_names_in_ = feature_names_in_self.n_features_in_ =len(feature_names_in_)def fit(self, X, y=None):""" Fit the transformer. This method doesn't do anything as no fitting is necessary. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: self: The instance itself. """returnselfdef transform(self, X, y=None):""" Apply the transformation. Selects the features from the input data. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: DataFrame: A DataFrame with only the selected features. """return X.loc[:, self.feature_names_in_].copy(deep=True)``````{python}# Selecting columns by dtypesnumerical_columns = X_train.head().select_dtypes("number").columns.to_list()categorical_columns = X_train.head().select_dtypes("object").columns.to_list()```Addressing missing values is a crucial preliminary step in our machine learning pipeline, as certain algorithms are sensitive to data gaps. To handle this, we'll employ imputation techniques tailored to the data types of the columns.For numerical columns, we'll adopt the "median" strategy for imputation. This approach involves replacing missing values with the median of the available data in the respective numerical column. It's a robust choice for handling missing values in numerical data as it's less sensitive to outliers.Conversely, for categorical columns, we'll opt for imputation using the most frequent values in each column. By filling in missing categorical data with the mode (most common value) for that column, we ensure that the imputed values align with the existing categorical distribution, preserving the integrity of the categorical features.This systematic approach to imputation sets a solid foundation for subsequent machine learning algorithms, ensuring that our dataset is well-prepared for analysis and modeling.```{python}# Prepare pipelines for corresponding columns:numerical_pipeline = pipeline.Pipeline( steps=[ ("num_selector", FeatureSelector(numerical_columns)), ("imputer", impute.SimpleImputer(strategy="median")), ("std_scaler", preprocessing.MinMaxScaler()), ])categorical_pipeline = pipeline.Pipeline( steps=[ ("cat_selector", FeatureSelector(categorical_columns)), ("imputer", impute.SimpleImputer(strategy="most_frequent")), ("onehot", preprocessing.OneHotEncoder(handle_unknown="ignore", sparse_output=True), ), ])```Once we are satisfied with the individual pipelines designed for numerical and categorical feature processing, the next step involves merging them into a unified pipeline using the `FeatureUnion` method provided by `scikit-learn`.```{python}# Put all the pipelines inside a FeatureUnion:data_preprocessing_pipeline = pipeline.FeatureUnion( n_jobs=-1, transformer_list=[ ("numerical_pipeline", numerical_pipeline), ("categorical_pipeline", categorical_pipeline), ],)```# Compare the performance of several algorithmsBringing all these components together in our machine learning pipeline is the culmination of our data preparation and model evaluation process. 1. **Algorithm Selection**: Choose a set of machine learning algorithms that you want to evaluate.2. **Data Split**: Utilize the `ShuffleSplit` method to generate randomized indices for your data into train and test sets. This ensures randomness in data selection and is crucial for unbiased evaluation.3. **Model Training and Evaluation**: For each selected algorithm follow these steps: - Fit the model on the training data. - Evaluate the model using negative mean squared error, root mean squared log error and coefficient of determination as the scoring metric. - Record the training and test scores, as well as the standard deviation of scores to assess the model's performance. - Measure the time taken to fit each model, which provides insights into computational performance.```{python}with warnings.catch_warnings(): warnings.simplefilter(action="ignore", category=FutureWarning) MLA = [ linear_model.LinearRegression(), linear_model.SGDRegressor(), linear_model.PassiveAggressiveRegressor(), linear_model.RANSACRegressor(), linear_model.Lasso(), svm.SVR(), ensemble.GradientBoostingRegressor(), tree.DecisionTreeRegressor(), ensemble.RandomForestRegressor(), ensemble.ExtraTreesRegressor(), ensemble.AdaBoostRegressor(), catboost.CatBoostRegressor(silent=True), lgb.LGBMRegressor(verbose=-1), xgboost.XGBRegressor(verbosity=0), dummy.DummyClassifier(), ]# note: this is an alternative to train_test_split cv_split = model_selection.ShuffleSplit( n_splits=10, test_size=0.3, train_size=0.6, random_state=0 ) # run model 10x with 60/30 split intentionally leaving out 10%# create table to compare MLA metrics MLA_columns = ["MLA Name","MLA Parameters","MLA Train RMSE Mean","MLA Test RMSE Mean","MLA Train RMSLE Mean","MLA Test RMSLE Mean","MLA Train R2 Mean","MLA Test R2 Mean","MLA Time", ] MLA_compare = pd.DataFrame(columns=MLA_columns) RMSLE = {"RMSLE": metrics.make_scorer(metrics.mean_squared_log_error, squared=False) }# index through MLA and save performance to table row_index =0for alg in tqdm(MLA):# set name and parameters MLA_name = alg.__class__.__name__ MLA_compare.loc[row_index, "MLA Name"] = MLA_name MLA_compare.loc[row_index, "MLA Parameters"] =str(alg.get_params()) model_pipeline = pipeline.Pipeline( steps=[ ("data_preprocessing_pipeline", data_preprocessing_pipeline), ("model", alg), ] ) cv_results = model_selection.cross_validate( model_pipeline, X_train, y_train, cv=cv_split, scoring={"RMSLE": RMSLE["RMSLE"],"r2": "r2","neg_mean_squared_error": "neg_mean_squared_error", }, return_train_score=True, ) MLA_compare.loc[row_index, "MLA Time"] = cv_results["fit_time"].mean() MLA_compare.loc[row_index, "MLA Train RMSE Mean"] = cv_results["train_neg_mean_squared_error" ].mean() MLA_compare.loc[row_index, "MLA Test RMSE Mean"] = cv_results["test_neg_mean_squared_error" ].mean() MLA_compare.loc[row_index, "MLA Train RMSLE Mean"] = cv_results["train_RMSLE" ].mean() MLA_compare.loc[row_index, "MLA Test RMSLE Mean"] = cv_results["test_RMSLE" ].mean() MLA_compare.loc[row_index, "MLA Train R2 Mean"] = cv_results["train_r2"].mean() MLA_compare.loc[row_index, "MLA Test R2 Mean"] = cv_results["test_r2"].mean() row_index +=1 clear_output(wait=True)# display(MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True))( MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True) .drop(columns="MLA Parameters") .convert_dtypes() .set_index("MLA Name") .style.set_table_styles( [ {"selector": "th.col_heading","props": "text-align: center; font-size: 1.0em;", }, {"selector": "td", "props": "text-align: center;"}, {"selector": "td:hover","props": "font-style: italic; color: black; font-weight:bold; background-color : #ffffb3;", }, ], overwrite=False, ) .format(precision=3, thousands=",", decimal=".") .background_gradient(cmap="coolwarm", axis=0))```The table above clearly shows that the `CatBoostRegressor` has performed exceptionally well, achieving the best scores in RMSE, RMSLE, and R2 on the test set. It has outperformed the `LGBMRegressor`, `ExtraTreesRegressor`, `GradientBoostingRegressor`, and even the `XGBRegressor`.In the next section, we will dive deeper into optimizing our model. This will involve refining model settings, enhancing features, and employing techniques to improve our overall predictive accuracy. Looking forward to seeing you there!