In the preceding section, our emphasis was on establishing a fundamental understanding of our data through characterizing the cleaned scraped dataset. We dived into feature cardinality, distributions, and potential correlations with our target variable—property price. Moving on to Part 2, our agenda includes examining essential sample pre-processing steps before modeling. We will craft the necessary pipeline, assess multiple algorithms, and ultimately select a suitable baseline model. Let’s get started!
Note
You can explore the project’s app on its website. For more details, visit the GitHub repository.
utils.seed_everything(utils.Configuration.seed)df = ( pd.read_parquet( Path.cwd() .joinpath("data") .joinpath("2023-10-01_Processed_dataset_for_NB_use.gzip") ) .sample(frac=1, random_state=utils.Configuration.seed) .reset_index(drop=True) .assign(price=lambda df: np.log10(df.price)))print(f"Shape of dataframe after read-in a pre-processing: {df.shape}")X = df.drop(columns=utils.Configuration.target_col)y = df[utils.Configuration.target_col]print(f"Shape of X: {X.shape}")print(f"Shape of y: {y.shape}")
Shape of dataframe after read-in a pre-processing: (3660, 50)
Shape of X: (3660, 49)
Shape of y: (3660,)
Train-test split
The subsequent phase in our data preparation involves the partitioning of our dataset into training and testing subsets. To accomplish this, we’ll use the model_selection.train_test_split method. This step ensures that we have distinct sets for model training and evaluation, a fundamental practice in machine learning.
Code
X_train, X_test, y_train, y_test = model_selection.train_test_split( X, y, test_size=0.2, random_state=utils.Configuration.seed)print(f"Shape of X-train: {X_train.shape}")print(f"Shape of X-test: {X_test.shape}")
Shape of X-train: (2928, 49)
Shape of X-test: (732, 49)
Implementing the data-processing pipeline
In order to compare various machine learning algorithms effectively, our initial approach will involve constructing a straightforward pipeline. This pipeline’s primary objective is to segregate columns based on their data types, recognizing the need for distinct preprocessing steps for continuous (numerical) and categorical variables. To facilitate this process within our scikit-learn pipeline, we will begin by implementing a custom class named FeatureSelector.
The rationale behind this is to establish a structured approach to feature handling. The FeatureSelector class will provide us with a streamlined means to access and process columns based on their data typess.
Code
class FeatureSelector(BaseEstimator, TransformerMixin):""" A transformer for selecting specific columns from a DataFrame. This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base. It overrides the fit and transform methods from the parent classes. Attributes: feature_names_in_ (list): The names of the features to select. n_features_in_ (int): The number of features to select. Methods: fit(X, y=None): Fit the transformer. Returns self. transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features. """def__init__(self, feature_names_in_):""" Constructs all the necessary attributes for the FeatureSelector object. Args: feature_names_in_ (list): The names of the features to select. """self.feature_names_in_ = feature_names_in_self.n_features_in_ =len(feature_names_in_)def fit(self, X, y=None):""" Fit the transformer. This method doesn't do anything as no fitting is necessary. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: self: The instance itself. """returnselfdef transform(self, X, y=None):""" Apply the transformation. Selects the features from the input data. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: DataFrame: A DataFrame with only the selected features. """return X.loc[:, self.feature_names_in_].copy(deep=True)
Code
# Selecting columns by dtypesnumerical_columns = X_train.head().select_dtypes("number").columns.to_list()categorical_columns = X_train.head().select_dtypes("object").columns.to_list()
Addressing missing values is a crucial preliminary step in our machine learning pipeline, as certain algorithms are sensitive to data gaps. To handle this, we’ll employ imputation techniques tailored to the data types of the columns.
For numerical columns, we’ll adopt the “median” strategy for imputation. This approach involves replacing missing values with the median of the available data in the respective numerical column. It’s a robust choice for handling missing values in numerical data as it’s less sensitive to outliers.
Conversely, for categorical columns, we’ll opt for imputation using the most frequent values in each column. By filling in missing categorical data with the mode (most common value) for that column, we ensure that the imputed values align with the existing categorical distribution, preserving the integrity of the categorical features.
This systematic approach to imputation sets a solid foundation for subsequent machine learning algorithms, ensuring that our dataset is well-prepared for analysis and modeling.
Once we are satisfied with the individual pipelines designed for numerical and categorical feature processing, the next step involves merging them into a unified pipeline using the FeatureUnion method provided by scikit-learn.
Code
# Put all the pipelines inside a FeatureUnion:data_preprocessing_pipeline = pipeline.FeatureUnion( n_jobs=-1, transformer_list=[ ("numerical_pipeline", numerical_pipeline), ("categorical_pipeline", categorical_pipeline), ],)
Compare the performance of several algorithms
Bringing all these components together in our machine learning pipeline is the culmination of our data preparation and model evaluation process.
Algorithm SelectionWe choose a set of machine learning algorithms that we want to evaluate.
Data Split: Here we use the ShuffleSplit method to generate randomized indices for our data into training and test sets. This ensures randomness in data selection and is crucial for unbiased evaluation.
Model Training and Evaluation: For each selected algorwe followfollow these steps:
Fit the model on the training data.
Evaluate the model using negative mean squared error (neg_mean_squared_error), root mean squared log error (mean_squared_log_error) and coefficient of determination (r2_score) as the scoring metric.
Record the training and test scores, as well as the standard deviation of scores
Measure the time taken to fit each model, which provides insights into computational peformance.formance.
Code
with warnings.catch_warnings(): warnings.simplefilter(action="ignore", category=FutureWarning) MLA = [ linear_model.LinearRegression(), linear_model.SGDRegressor(), linear_model.PassiveAggressiveRegressor(), linear_model.RANSACRegressor(), linear_model.Lasso(), svm.SVR(), ensemble.GradientBoostingRegressor(), tree.DecisionTreeRegressor(), ensemble.RandomForestRegressor(), ensemble.ExtraTreesRegressor(), ensemble.AdaBoostRegressor(), catboost.CatBoostRegressor(silent=True), lgb.LGBMRegressor(verbose=-1), xgboost.XGBRegressor(verbosity=0), dummy.DummyRegressor(), ]# note: this is an alternative to train_test_split cv_split = model_selection.ShuffleSplit( n_splits=10, test_size=0.3, train_size=0.6, random_state=0 ) # run model 10x with 60/30 split intentionally leaving out 10%# create table to compare MLA metrics MLA_columns = ["MLA Name","MLA Parameters","MLA Train RMSE Mean","MLA Test RMSE Mean","MLA Train RMSLE Mean","MLA Test RMSLE Mean","MLA Train R2 Mean","MLA Test R2 Mean","MLA Time", ] MLA_compare = pd.DataFrame(columns=MLA_columns) RMSLE = {"RMSLE": metrics.make_scorer(metrics.mean_squared_log_error, squared=False) }# index through MLA and save performance to table row_index =0for alg in tqdm(MLA):# set name and parameters MLA_name = alg.__class__.__name__ MLA_compare.loc[row_index, "MLA Name"] = MLA_name MLA_compare.loc[row_index, "MLA Parameters"] =str(alg.get_params()) model_pipeline = pipeline.Pipeline( steps=[ ("data_preprocessing_pipeline", data_preprocessing_pipeline), ("model", alg), ] ) cv_results = model_selection.cross_validate( model_pipeline, X_train, y_train, cv=cv_split, scoring={"RMSLE": RMSLE["RMSLE"],"r2": "r2","neg_mean_squared_error": "neg_mean_squared_error", }, return_train_score=True, ) MLA_compare.loc[row_index, "MLA Time"] = cv_results["fit_time"].mean() MLA_compare.loc[row_index, "MLA Train RMSE Mean"] = cv_results["train_neg_mean_squared_error" ].mean() MLA_compare.loc[row_index, "MLA Test RMSE Mean"] = cv_results["test_neg_mean_squared_error" ].mean() MLA_compare.loc[row_index, "MLA Train RMSLE Mean"] = cv_results["train_RMSLE" ].mean() MLA_compare.loc[row_index, "MLA Test RMSLE Mean"] = cv_results["test_RMSLE" ].mean() MLA_compare.loc[row_index, "MLA Train R2 Mean"] = cv_results["train_r2"].mean() MLA_compare.loc[row_index, "MLA Test R2 Mean"] = cv_results["test_r2"].mean() row_index +=1 clear_output(wait=True)# display(MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True))( MLA_compare.sort_values(by=["MLA Test RMSLE Mean"], ascending=True) .drop(columns="MLA Parameters") .convert_dtypes() .set_index("MLA Name") .style.set_table_styles( [ {"selector": "th.col_heading","props": "text-align: center; font-size: 1.0em;", }, {"selector": "td", "props": "text-align: center;"}, {"selector": "td:hover","props": "font-style: italic; color: black; font-weight:bold; background-color : #ffffb3;", }, ], overwrite=False, ) .format(precision=3, thousands=",", decimal=".") .background_gradient(cmap="coolwarm", axis=0))
The table above clearly shows that the CatBoostRegressor has performed exceptionally well, achieving the best scores in RMSE, RMSLE, and R2 on the test set. It has outperformed the LGBMRegressor, ExtraTreesRegressor, GradientBoostingRegressor, and even the XGBRegressor.
In the next section, we will dive deeper into optimizing our model. This will involve refining model settings, enhancing features, and employing techniques to improve our overall predictive accuracy. Looking forward to seeing you there!