In Part 5, we looked at the significance of features in the initial scraped dataset using both the feature_importances_ method of CatBoostRegressor and SHAP values. We conducted feature elimination based on their importance and predictive capability.
In this upcoming phase, we’ll implement a robust cross-validation strategy to accurately and consistently evaluate our model’s performance across multiple folds of the dataWe will also i Identing and addreng potential outliers within our datas, whichet is crucial to prevent their undue influence on the model’s predictions.
Additionally, we’ll further refine and expand our feature engineering efforts by exploring new methodologies to create informative features that bolster our model’s predictive capabilities. Looking forward to these pivotal steps!
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
Prepare dataframe before modelling
Read in dataframe
Drawing from our findings in notebook 4, particularly with regards to our initial feature reduction efforts, we’ve developed a function named “prepare_data_for_modelling.” This function resides in the pre_process.py file, ensuring its reusability. The function performs essential data preprocessing steps, which include:
Randomly shuffling the rows in the DataFrame.
Transforming the ‘price’ column by taking the base 10 logarithm.
Handling missing values in categorical variables by replacing them with ‘missing value.’
Separating the dataset into features (X) and the target variable (y).
Let’s dive into the details of this function and prepare our X and y for the subsequent processing pipeline.
def prepare_data(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:""" Prepare data for machine learning modeling. This function takes a DataFrame and prepares it for machine learning by performing the following steps: 1. Randomly shuffles the rows of the DataFrame. 2. Converts the 'price' column to the base 10 logarithm. 3. Fills missing values in categorical variables with 'missing value'. 4. Separates the features (X) and the target (y). Parameters: - df (pd.DataFrame): The input DataFrame containing the dataset. Returns: - Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y). Example use case: # Load your dataset into a DataFrame (e.g., df) df = load_data() # Prepare the data for modeling X, y = prepare_data_for_modelling(df) # Now you can use X and y for machine learning tasks. """ processed_df = ( df.sample(frac=1, random_state=utils.Configuration.seed) .reset_index(drop=True) .assign(price=lambda df: np.log10(df.price)) )# Fill missing categorical variables with "missing value"for col in processed_df.columns:if processed_df[col].dtype.name in ("bool", "object", "category"): processed_df[col] = processed_df[col].fillna("missing value")# Separate features (X) and target (y) X = processed_df.loc[:, utils.Configuration.features_to_keep_v1] y = processed_df[utils.Configuration.target_col]print(f"Shape of X and y: {X.shape}, {y.shape}")return X, y
Code
X, y = prepare_data(df)
Shape of X and y: (3660, 16), (3660,)
Cross-validation strategy
Our next critical step is to establish a well-structured cross-validation strategy. This step is imperative as it enables us to assess the effectiveness of various feature engineering approaches without risking overfitting our model. A robust cross-validation strategy ensures that our model’s performance evaluations are reliable and that the insights gained are generalizable to new data. To accomplish this, we will employ RepeatedKFold validation, setting the parameters with n_splits as 10 and n_repeats as 1.
In essence, this configuration signifies that we will perform a 10-fold cross-validation, and this entire process will be repeated once. Importantly, due to the modular nature of this function, we retain the flexibility to easily adapt and alter the design of our cross-validation strategy as needed.
Code
def run_catboost_CV( X: pd.DataFrame, y: pd.Series, n_splits: int=10, n_repeats: int=1, pipeline: Optional[object] =None,) -> Tuple[float, float]:""" Perform Cross-Validation with CatBoost for regression. This function conducts Cross-Validation using CatBoost for regression tasks. It iterates through folds, trains CatBoost models, and computes the mean and standard deviation of the Root Mean Squared Error (RMSE) scores across folds. Parameters: - X (pd.DataFrame): The feature matrix. - y (pd.Series): The target variable. - n_splits (int, optional): The number of splits in K-Fold cross-validation. Defaults to 2. - n_repeats (int, optional): The number of times the K-Fold cross-validation is repeated. Defaults to 1. - pipeline (object, optional): Optional data preprocessing pipeline. If provided, it's applied to the data before training the model. Defaults to None. Returns: - Tuple[float, float]: A tuple containing the mean RMSE and standard deviation of RMSE scores across cross-validation folds. Example: # Load your feature matrix (X) and target variable (y) X, y = load_data() # Perform Cross-Validation with CatBoost mean_rmse, std_rmse = run_catboost_CV(X, y, n_splits=5, n_repeats=2, pipeline=data_pipeline) print(f"Mean RMSE: {mean_rmse:.4f}") print(f"Standard Deviation of RMSE: {std_rmse:.4f}") Notes: - Ensure that the input data `X` and `y` are properly preprocessed and do not contain any missing values. - The function uses CatBoost for regression with optional data preprocessing via the `pipeline`. - RMSE is a common metric for regression tasks, and lower values indicate better model performance. """ results = []# Extract feature names and data types features = X.columns[~X.columns.str.contains("price")] numerical_features = X.select_dtypes("number").columns.to_list() categorical_features = X.select_dtypes("object").columns.to_list()# Create a K-Fold cross-validator CV = model_selection.RepeatedKFold( n_splits=n_splits, n_repeats=n_repeats, random_state=utils.Configuration.seed )for train_fold_index, val_fold_index in tqdm(CV.split(X)): X_train_fold, X_val_fold = X.loc[train_fold_index], X.loc[val_fold_index] y_train_fold, y_val_fold = y.loc[train_fold_index], y.loc[val_fold_index]# Apply optional data preprocessing pipelineif pipeline isnotNone: X_train_fold = pipeline.fit_transform(X_train_fold) X_val_fold = pipeline.transform(X_val_fold)# Create CatBoost datasets catboost_train = Pool( X_train_fold, y_train_fold, cat_features=categorical_features, ) catboost_valid = Pool( X_val_fold, y_val_fold, cat_features=categorical_features, )# Initialize and train the CatBoost model model = catboost.CatBoostRegressor(**utils.Configuration.catboost_params) model.fit( catboost_train, eval_set=[catboost_valid], early_stopping_rounds=utils.Configuration.early_stopping_round, verbose=utils.Configuration.verbose, use_best_model=True, )# Calculate OOF validation predictions valid_pred = model.predict(X_val_fold) RMSE_score = metrics.mean_squared_error(y_val_fold, valid_pred, squared=False) results.append(RMSE_score)return np.mean(results), np.std(results)
Now, let’s proceed to train our model with the updated settings:
Code
train_model.run_catboost_CV(X, y)
(0.11251233080551612, 0.004459362099695207)
Note
Note that we’ve reduced the number of iterations in Notebook 6 compared to Notebook 5 to minimize the training duration. In Notebook 5: iterations = 1000, default learning rate = 0.03 In Notebook 6: iterations = 100, learning rate = 0.2
The performance of the baseline model: 0.1125
Outlier detection
An outlier is a data point that significantly differs from the rest of the data. One common way to define an outlier is a data point that falls more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile. Detecting and removing outliers from the dataset is crucial for building a stable model that can effectively generalize to new data.
When we create a scatter plot of our features (as shown in Figure 1), such as cadastral income against living area, and adjust the points’ color and size based on price, we can identify at least two data points that notably deviate from the expected range of values. One data point suggests a 300 m2 property is associated with a cadastral income exceeding 320,000 EURO, while the other point indicates a 2,500 EUR cadastral income for an 11,000 m2 property. Both observations seem implausible when compared to the majority of data points on the graph.
Code
pd.concat([X, y], axis=1).pipe(lambda df: ggplot( df, aes("cadastral_income", "living_area", fill="price", size="price") )+ geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)+ scale_fill_continuous(low="#1a9641", high="#d7191c")+ labs( title="Assessing Potential Outliers", subtitle=""" Outliers pose a challenge for gradient boosting methods since boosting constructs each tree based on the errors of the previous trees. Outliers, having significantly larger errors than non-outliers, can excessively divert the model's attention toward these data points. """, x="Cadastral income (EUR)", y="Living area (m2)", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance )+ ggsize(800, 600))
For identifying potential outliers within our data, we can employ Scikit-learn’s LocalOutlierFactor. This algorithm, known as the Local Outlier Factor (LOF), is an unsupervised technique for anomaly detection. It assesses the local density deviation of a data point relative to its neighboring points. LOF identifies outliers as those data points demonstrating notably lower density in comparison to their neighbors.
In the provided code, we’ve created a function called identify_outliers. This function generates a mask that we can use to filter out data points potentially flagged as outliers.
Code
def identify_outliers(df: pd.DataFrame) -> pd.Series:""" Identify outliers in a DataFrame. This function uses a Local Outlier Factor (LOF) algorithm to identify outliers in a given DataFrame. It operates on both numerical and categorical features, and it returns a binary Series where `True` represents an outlier and `False` represents a non-outlier. Parameters: - df (pd.DataFrame): The input DataFrame containing features for outlier identification. Returns: - pd.Series: A Boolean Series indicating outliers (True) and non-outliers (False). Example: # Load your DataFrame with features (df) df = load_data() # Identify outliers using the function outlier_mask = identify_outliers(df) # Use the outlier mask to filter your DataFrame filtered_df = df[~outlier_mask] # Keep non-outliers Notes: - The function uses Local Outlier Factor (LOF) with default parameters for identifying outliers. - Numerical features are imputed using median values, and categorical features are one-hot encoded and imputed with median values. - The resulting Boolean Series is `True` for outliers and `False` for non-outliers. """# Extract numerical and categorical feature names NUMERICAL_FEATURES = df.select_dtypes("number").columns.tolist() CATEGORICAL_FEATURES = df.select_dtypes("object").columns.tolist()# Define transformers for preprocessing numeric_transformer = pipeline.Pipeline( steps=[("imputer", impute.SimpleImputer(strategy="median"))] ) categorical_transformer = pipeline.Pipeline( steps=[ ("encoder", preprocessing.OneHotEncoder(handle_unknown="ignore")), ("imputer", impute.SimpleImputer(strategy="median")), ] )# Create a ColumnTransformer to handle both numerical and categorical features preprocessor = compose.ColumnTransformer( transformers=[ ("num", numeric_transformer, NUMERICAL_FEATURES), ("cat", categorical_transformer, CATEGORICAL_FEATURES), ] )# Initialize the LocalOutlierFactor model clf = neighbors.LocalOutlierFactor()# Fit LOF to preprocessed data and make predictions y_pred = clf.fit_predict(preprocessor.fit_transform(df))# Adjust LOF predictions to create a binary outlier mask y_pred_adjusted = [1if x ==-1else0for x in y_pred] outlier_mask = pd.Series(y_pred_adjusted) ==0return outlier_mask
As a comparison, here’s the scatter plot after removing outliers. It appears that the LocalOutlierFactor method was effective in addressing the outlier data points.
Code
outlier_mask = pre_process.identify_outliers(X)X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True)y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)( pd.concat([X_wo_outliers, y_wo_outliers], axis=1)# .loc[lambda df: pre_process.identify_outliers(df.loc[:, :"living_area"])] .pipe(lambda df: ggplot( df, aes("cadastral_income", "living_area", fill="price", size="price") )+ geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)+ scale_fill_continuous(low="#1a9641", high="#d7191c")+ labs( title="Assessing Potential Outliers", subtitle=""" By employing the default parameters of LocalOutlierFactor, we've reduced our training set from 3660 instances to 3427. This is expected to enhance our model's performance and its ability to generalize well to new data. """, x="Cadastral income (EUR)", y="Living area (m2)", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance )+ ggsize(800, 600) ))
Now, let’s assess whether our efforts to improve the model by addressing outliers have enhanced its predictive capabilities:
By removing the outliers, our cross-validation RMSE score decreased from 0.1125 to 0.1105.
Feature Engineering
Feature engineering is vital in machine learning as it directly influences a model’s performance and predictive capabilities. By crafting and selecting pertinent features, it allows the model to capture meaningful patterns and relationships within the data. Effective feature engineering helps improve model accuracy, enhances its ability to generalize to new data, and enables the extraction of valuable insights, ultimately driving the success and efficacy of machine learning algorithms.
Feature Engineering ideas we will test in this section: - Utilize categorical columns for grouping and transform each numerical variable based on the median. - Generate bins from the continuous variables and apply the same process as described above. - Introduce polynomial features, either individually with a single feature or in combinations of two features. - Form clusters of instances using k-means clustering to capture data similarities and use these clusters as additional features. - Implement other ideas derived from empirical observations or assumptions
Utilize categorical columns for grouping and transform each numerical variable based on the median
The idea behind this feature engineering step is to leverage categorical columns as grouping criteria and then calculate the median value for each numerical variable within each group. By doing so, it aims to create new features that capture the central tendency of the numerical data for different categories, allowing the model to better understand and utilize the inherent patterns and variations within the data.
Code
# Number of unique categories per categorical variables:X_wo_outliers.select_dtypes("object").nunique()
state 9
kitchen_type 9
street 456
building_condition 7
city 230
dtype: int64
Code
def FE_categorical_transform( X: pd.DataFrame, y: pd.Series, transform_type: str="mean") -> pd.DataFrame:""" Feature Engineering: Transform categorical features using CatBoost Cross-Validation. This function performs feature engineering by transforming categorical features using CatBoost Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean Squared Error (RMSE) scores for various combinations of categorical and numerical features. Parameters: - X (pd.DataFrame): The input DataFrame containing both categorical and numerical features. - transform_type (str, optional): The transformation type, such as "mean" or other valid CatBoost transformations. Defaults to "mean". Returns: - pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "categorical," and "numerical," sorted by "mean_OOFs" in ascending order. Example: # Load your DataFrame with features (X) X = load_data() # Perform feature engineering result_df = FE_categorical_transform(X, transform_type="mean") # View the DataFrame with sorted results print(result_df.head()) Notes: - This function uses CatBoost Cross-Validation to assess the quality of transformations for various combinations of categorical and numerical features. - The resulting DataFrame provides insights into the effectiveness of different transformations. - Feature engineering can help improve the performance of machine learning models. """# Initialize a list to store results results = []# Get a list of categorical and numerical columns categorical_columns = X.select_dtypes("object").columns numerical_columns = X.select_dtypes("number").columns# Combine the loops to have a single progress barfor categorical in tqdm(categorical_columns, desc="Progress"):for numerical in tqdm(numerical_columns):# Create a deep copy of the input data temp = X.copy(deep=True)# Calculate the transformation for each group within the categorical column temp["new_column"] = temp.groupby(categorical)[numerical].transform( transform_type )# Run CatBoost Cross-Validation with the transformed data mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)# Store the results as a tuple result = (mean_OOF, std_OOF, categorical, numerical) results.append(result)del temp, mean_OOF, std_OOF# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs", "std_OOFs", "categorical", "numerical"] ) result_df = result_df.sort_values(by="mean_OOFs")return result_df
Note
Please, bear in mind that these feature engineering steps were precomputed due to the considerable computational time required. The outcomes were saved rather than executed during the notebook rendering to save time. However, it’s important to note that the results should remain unchanged.
As evident, the best result was obtained by treating the city feature as a categorical variable and calculating the median of cadastral_income based on this categorization. This result aligns logically with the feature importances seen in Part 5.
The idea behind this feature engineering step is to discretize continuous variables by creating bins or categories from their values. These bins then serve as categorical columns. By using these new categorical columns for grouping, we can transform each numerical variable by replacing its values with the median of the respective category it belongs to, just like the feature engineering method we demonstrated above.
Code
def FE_continuous_transform( X: pd.DataFrame, y: pd.Series, transform_type: str="mean") -> pd.DataFrame:""" Feature Engineering: Transform continuous features using CatBoost Cross-Validation. This function performs feature engineering by transforming continuous features using CatBoost Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean Squared Error (RMSE) scores for various combinations of discretized and transformed continuous features. Parameters: - X (pd.DataFrame): The input DataFrame containing both continuous and categorical features. - y (pd.Series): The target variable for prediction. - transform_type (str, optional): The transformation type, such as "mean" or other valid CatBoost transformations. Defaults to "mean". Returns: - pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "discretized_continuous," and "transformed_continuous," sorted by "mean_OOFs" in ascending order. Example: # Load your DataFrame with features (X) and target variable (y) X, y = load_data() # Perform feature engineering result_df = FE_continuous_transform(X, y, transform_type="mean") # View the DataFrame with sorted results print(result_df.head()) Notes: - This function uses CatBoost Cross-Validation to assess the quality of transformations for various combinations of discretized and transformed continuous features. - The number of bins for discretization is determined using Sturges' rule. - The resulting DataFrame provides insights into the effectiveness of different transformations. - Feature engineering can help improve the performance of machine learning models. """# Initialize a list to store results results = []# Get a list of continuous and numerical columns continuous_columns = X.select_dtypes("number").columns optimal_bins =int(np.floor(np.log2(X.shape[0])) +1)# Combine the loops to have a single progress barfor discretized_continuous in tqdm(continuous_columns, desc="Progress:"):for transformed_continuous in tqdm(continuous_columns):if discretized_continuous != transformed_continuous:# Create a deep copy of the input data temp = X.copy(deep=True) discretizer = pipeline.Pipeline( steps=[ ("imputer", impute.SimpleImputer(strategy="median")), ("add_bins", preprocessing.KBinsDiscretizer( encode="ordinal", n_bins=optimal_bins ), ), ] ) temp[discretized_continuous] = discretizer.fit_transform( X[[discretized_continuous]] )# Calculate the transformation for each group within the categorical column temp["new_column"] = temp.groupby(discretized_continuous)[ transformed_continuous ].transform(transform_type)# Run CatBoost Cross-Validation with the transformed data mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)# Store the results as a tuple result = ( mean_OOF, std_OOF, discretized_continuous, transformed_continuous, ) results.append(result)del temp, mean_OOF, std_OOF# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs","std_OOFs","discretized_continuous","transformed_continuous", ], ) result_df = result_df.sort_values(by="mean_OOFs")return result_df
This approach was not as effective as our prior method. However, combining bathrooms with yearly_theoretical_total_energy_consumption yielded the best outcome.
The idea behind introducing polynomial features is to capture non-linear relationships within the data. By raising individual features to higher powers or considering interactions between pairs of features, this step allows the model to better represent complex patterns that cannot be adequately expressed with linear relationships alone. It enhances the model’s ability to learn and predict outcomes that exhibit curvilinear or interactive behavior.
Code
def FE_polynomial_features( X: pd.DataFrame, y: pd.Series, combinations: int=1) -> pd.DataFrame:""" Generate polynomial features for combinations of numerical columns and train a CatBoost model. Parameters: X (pd.DataFrame): The input DataFrame with features. y (pd.Series): The target variable. combinations (int, optional): The number of combinations of numerical columns. Default is 1. Returns: pd.DataFrame: A DataFrame containing results sorted by mean OOF scores. Example: X_wo_outliers = pd.DataFrame(...) # Your input data y_wo_outliers = pd Series(...) # Your target variable result = FE_polynomial_features(X_wo_outliers, y_wo_outliers) Transformations: - Imputes missing values in numerical columns using the median. - Generates polynomial features, including interaction terms, for selected numerical columns. - Trains a CatBoost model and calculates mean and standard deviation of out-of-fold (OOF) scores. """# Initialize a list to store results results = []# Get a list of continuous and numerical columns numerical_columns = X.select_dtypes("number").columns# Combine the loops to have a single progress barfor numerical_col in tqdm(list(itertools.combinations(numerical_columns, r=combinations)) ): polyfeatures = compose.make_column_transformer( ( pipeline.make_pipeline( impute.SimpleImputer(strategy="median"), preprocessing.PolynomialFeatures(interaction_only=False), ),list(numerical_col), ), remainder="passthrough", ).set_output(transform="pandas") temp = polyfeatures.fit_transform(X) mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)# Store the results as a tuple result = ( mean_OOF, std_OOF, numerical_col, ) results.append(result)del temp, mean_OOF, std_OOF# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs","std_OOFs","numerical_col", ], ) result_df = result_df.sort_values(by="mean_OOFs")return result_df
n=1
Let’s see the impact of applying polynomial feature engineering to a single feature.
Form clusters of instances using k-means clustering
The idea behind using k-means clustering in feature engineering is to group data points into clusters based on their similarity. By doing so, we create a new set of features that represent these clusters, which can capture patterns or relationships within the data that might be less apparent in the original features. These cluster features can be valuable for machine learning models, as they provide a more compact and informative representation of the data, potentially improving predictive performance.
Code
class FeatureSelector(BaseEstimator, TransformerMixin):""" A transformer for selecting specific columns from a DataFrame. This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base. It overrides the fit and transform methods from the parent classes. Attributes: feature_names_in_ (list): The names of the features to select. n_features_in_ (int): The number of features to select. Methods: fit(X, y=None): Fit the transformer. Returns self. transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features. """def__init__(self, feature_names_in_):""" Constructs all the necessary attributes for the FeatureSelector object. Args: feature_names_in_ (list): The names of the features to select. """self.feature_names_in_ = feature_names_in_self.n_features_in_ =len(feature_names_in_)def fit(self, X, y=None):""" Fit the transformer. This method doesn't do anything as no fitting is necessary. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: self: The instance itself. """returnselfdef transform(self, X, y=None):""" Apply the transformation. Selects the features from the input data. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: DataFrame: A DataFrame with only the selected features. """return X.loc[:, self.feature_names_in_].copy(deep=True)
Code
def FE_KMeans( X: pd.DataFrame, y: pd.Series, n_clusters_min: int=1, n_clusters_max: int=8,) -> pd.DataFrame:"""Performs K-Means clustering-based feature engineering followed by model training. Args: X (pd.DataFrame): The input feature matrix. y (pd.Series): The target variable. n_clusters_min (int, optional): The minimum number of clusters to consider. Defaults to 1. n_clusters_max (int, optional): The maximum number of clusters to consider. Defaults to 8. Returns: pd.DataFrame: A DataFrame containing the results of feature engineering with K-Means clustering. Example: >>> results_df = FE_KNN(X_wo_outliers, y_wo_outliers) """# Initialize a list to store results results = []# Get a list of continuous and numerical columns numerical_columns = X.head().select_dtypes("number").columns.to_list() categorical_columns = X.head().select_dtypes("object").columns.to_list()for n_cluster in tqdm(range(n_clusters_min, n_clusters_max)):# Prepare pipelines for corresponding columns: numerical_pipeline = pipeline.Pipeline( steps=[ ("num_selector", FeatureSelector(numerical_columns)), ("imputer", impute.SimpleImputer(strategy="median")), ] ) categorical_pipeline = pipeline.Pipeline( steps=[ ("cat_selector", FeatureSelector(categorical_columns)), ("imputer", impute.SimpleImputer(strategy="most_frequent")), ("onehot", preprocessing.OneHotEncoder( handle_unknown="ignore", sparse_output=False ), ), ] )# Put all the pipelines inside a FeatureUnion: data_preprocessing_pipeline = pipeline.FeatureUnion( n_jobs=-1, transformer_list=[ ("numerical_pipeline", numerical_pipeline), ("categorical_pipeline", categorical_pipeline), ], ) temp = pd.DataFrame(data_preprocessing_pipeline.fit_transform(X)) KMeans = cluster.KMeans(n_init=10, n_clusters=n_cluster) KMeans.fit_transform(temp) groups = pd.Series(KMeans.labels_, name="groups") concatanated_df = pd.concat([temp, groups], axis="columns") mean_OOF, std_OOF = train_model.run_catboost_CV(X=concatanated_df, y=y)# Store the results as a tuple result = ( mean_OOF, std_OOF, n_cluster, ) results.append(result)del temp, mean_OOF, std_OOF, KMeans, groups, concatanated_df, result# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs","std_OOFs","n_cluster", ], ) result_df = result_df.sort_values(by="mean_OOFs")return result_df
As k-means clustering is an unsupervised algorithm, determining the appropriate k-values requires testing various values to assess their impact on our validation scores. As observed, this approach didn’t yield significant results in our case, as the best validation score was obtained when n=1.
Implement other ideas derived from empirical observations or assumptions
Though new features can be generated through systematic methods, domain knowledge can also inspire their creation. The idea behind this is to allow for the incorporation of unconventional or domain-specific insights that may not fit standard feature engineering techniques. It encourages the exploration of novel features or transformations based on practical experiences or theoretical assumptions to potentially uncover hidden patterns or relationships within the data. This open-ended approach can lead to creative and tailored feature engineering solutions.
Here are some ideas to consider:
Geospatial Features:
Create clusters or neighborhoods based on features to capture similarities.
Area-related Features:
Calculate the ratio of “living_area” to “surface_of_the_plot” to get an idea of the density or spaciousness of the property.
Energy Efficiency Features:
Compute the energy efficiency ratio by dividing “yearly_theoretical_total_energy_consumption” by “primary_energy_consumption.”
Compute energy efficiency by dividing primary_energy_consumption with living_area
Toilet and Bathroom Features:
Combine “toilets” and “bathrooms” into a single “total_bathrooms” feature to simplify the model.
Calculate total number of rooms by adding up bedrooms + toilets + bathrooms
Taxation Features:
Incorporate “cadastral_income” as a measure of property value for taxation. You can create bins or categories for this variable.
Value for Money:
Divide cadastral_income by bedrooms to see if the property is a good bargain
similarly, Divide cadastral_income by living_area
Code
def FE_ideas(X):"""Performs additional feature engineering on the input DataFrame. Args: X (pd.DataFrame): The input DataFrame containing the original features. Returns: pd.DataFrame: A DataFrame with additional engineered features. Example: >>> engineered_data = FE_ideas(original_data) """ temp = X.assign( energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption/ df.primary_energy_consumption, energy_efficiency_2=lambda df: df.primary_energy_consumption / df.living_area, total_bathrooms=lambda df: df.toilets + df.bathrooms, total_number_rooms=lambda df: df.toilets + df.bathrooms + df.bedrooms, spaciousness_1=lambda df: df.living_area / df.surface_of_the_plot, spaciousness_2=lambda df: df.living_area / df.total_number_rooms, bargain_1=lambda df: df.cadastral_income / df.bedrooms, bargain_2=lambda df: df.cadastral_income / df.living_area, )return temp.loc[:, "energy_efficiency_1":]
Code
def FE_try_ideas( X: pd.DataFrame, y: pd.Series,) -> pd.DataFrame:"""Performs feature engineering experiments by adding new features and evaluating their impact on model performance. Args: X (pd.DataFrame): The input feature matrix. y (pd.Series): The target variable. Returns: pd.DataFrame: A DataFrame containing the results of feature engineering experiments. Example: >>> results_df = FE_try_ideas(X, y) """# Initialize a list to store results results = []# Get a list of continuous and numerical columns numerical_columns = X.select_dtypes("number").columns# Apply additional feature engineering ideas feature_df = FE_ideas(X)for feature in tqdm(feature_df.columns):# Concatenate the original features with the newly engineered feature temp = pd.concat([X, feature_df[feature]], axis="columns")# Train the model with the augmented features and get the mean and standard deviation of OOF scores mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)# Store the results as a tuple result = ( mean_OOF, std_OOF, feature, ) results.append(result)del temp, mean_OOF, std_OOF# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs","std_OOFs","feature", ], ) result_df = result_df.sort_values(by="mean_OOFs")return result_df
The initial model achieved the best mean out-of-folds score of 0.1107. However, we made modifications to expedite training by reducing iterations to 100 and increasing the learning rate to 0.2, resulting in a new baseline model with a score of 0.1105 after outlier removal. This serves as our reference point to assess the impact of various feature engineering techniques.
Subsequent feature engineering approaches, including utilizing categorical columns for groupby and transformation, creating bins from continuous data, and implementing other ideas, led to marginal score improvements, with the lowest at 0.1089. Polynomial features, with n=2 and n=1, demonstrated slightly higher scores of 0.1094 and 0.1099, respective.
Now, we will proceed to assess the efficacy of two of the best approaches, namely: utilizing categorical columns for groupby and transformation and implementing additional ideas. We will conduct this evaluation using CatBoost’s built-in select_features as outlined in part 5. Let’s dive in…
Condition
Best mean OOFs
std OOFs
Original
0.1107
NA
Use categorical columns for groupby/transform
0.1089
0.0062
Create bins from continuous data and use groupby/transform
Based on our evaluation, it is recommended to retain the following features:
‘city_group’
‘building_condition_group’
‘energy_efficiency_1’
‘energy_efficiency_2’
‘bargain_1’
‘bargain_2’
However, we should remove the following two features since our analysis indicates that better features have been incorporated:
‘kitchen_type’
‘toilets’
These feature selections should help optimize our model even further.
Based on these insights, we crafted the prepare_data_for_modelling function, which is stored in the pre_process.py file. This function includes the feature engineering steps we discussed, setting the stage for effective modeling performance.
Code
def prepare_data_for_modelling(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:""" Prepare data for machine learning modeling. This function takes a DataFrame and prepares it for machine learning by performing the following steps: 1. Randomly shuffles the rows of the DataFrame. 2. Converts the 'price' column to the base 10 logarithm. 3. Fills missing values in categorical variables with 'missing value'. 4. Separates the features (X) and the target (y). 5. Identifies and filters out outlier values based on LocalOutlierFactor. Parameters: - df (pd.DataFrame): The input DataFrame containing the dataset. Returns: - Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y). Example use case: # Load your dataset into a DataFrame (e.g., df) df = load_data() # Prepare the data for modeling X, y = prepare_data_for_modelling(df) # Now you can use X and y for machine learning tasks. Args: df (pd.DataFrame): The input DataFrame containing the dataset. Returns: Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y). """ processed_df = ( df.sample(frac=1, random_state=utils.Configuration.seed) .reset_index(drop=True) .assign( price=lambda df: np.log10(df.price), city_group=lambda df: df.groupby("city")["cadastral_income"].transform("median" ), building_condition_group=lambda df: df.groupby("building_condition")["yearly_theoretical_total_energy_consumption" ].transform("median"), energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption/ df.primary_energy_consumption, energy_efficiency_2=lambda df: df.primary_energy_consumption/ df.living_area, bargain_1=lambda df: df.cadastral_income / df.bedrooms, bargain_2=lambda df: df.cadastral_income / df.living_area, ) )# Fill missing categorical variables with "missing value"for col in processed_df.columns:if processed_df[col].dtype.name in ("bool", "object", "category"): processed_df[col] = processed_df[col].fillna("missing value")# Separate features (X) and target (y) X = processed_df.loc[:, utils.Configuration.features_to_keep_v2] y = processed_df[utils.Configuration.target_col] outlier_mask = pre_process.identify_outliers(X) X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True) y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)print(f"Shape of X and y with outliers: {X.shape}, {y.shape}")print(f"Shape of X and y without outliers: {X_wo_outliers.shape}, {y_wo_outliers.shape}" )return X_wo_outliers, y_wo_outliers
Code
X, y = pre_process.prepare_data_for_modelling(df)
Shape of X and y with outliers: (3660, 14), (3660,)
Shape of X and y without outliers: (3427, 14), (3427,)
In this notebook, we began with an initial selection of 16 features based on the work in Part 5. However, as we conclude this article, we’ve found that we can streamline our feature set even further by removing kitchen_type and toilets resulting in improved performance, thanks to the addition of new features. While there’s potential for further optimizations, such as dimensional reduction, we are currently happy with our progress. In the next, and final, part, we will focus on fine-tuning our model for optimal predictive performance. See you there!
Source Code
---title: 'Predicting Belgian Real Estate Prices: Part 6: Feature Engineering'author: Adam Cseresznyedate: '2023-11-08'categories: - Predicting Belgian Real Estate Pricesjupyter: python3toc: trueformat: html: code-fold: true code-tools: true---![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](https://cf.bstatic.com/xdata/images/hotel/max1024x768/408003083.jpg?k=c49b5c4a2346b3ab002b9d1b22dbfb596cee523b53abef2550d0c92d0faf2d8b&o=&hp=1){fig-align="center" width=50%}In Part 5, we looked at the significance of features in the initial scraped dataset using both the `feature_importances_` method of CatBoostRegressor and SHAP values. We conducted feature elimination based on their importance and predictive capability. In this upcoming phase, we'll implement a robust cross-validation strategy to accurately and consistently evaluate our model's performance across multiple folds of the dataWe will also i Identing and addreng potential outliers within our datas, whichet is crucial to prevent their undue influence on the model's predictions. Additionally, we'll further refine and expand our feature engineering efforts by exploring new methodologies to create informative features that bolster our model's predictive capabilities. Looking forward to these pivotal steps!::: {.callout-note}You can access the project's app through its [Streamlit website](https://belgian-house-price-predictor.streamlit.app/).:::# Import data```{python}#| editable: true#| slideshow: {slide_type: ''}#| tags: []import gcimport itertoolsfrom pathlib import Pathfrom typing import List, Optional, Tupleimport catboostimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport shapfrom data import pre_process, utilsfrom features import feature_engineeringfrom IPython.display import clear_outputfrom lets_plot import*from lets_plot.mapping import as_discretefrom models import train_modelfrom sklearn import ( cluster, compose, ensemble, impute, metrics, model_selection, neighbors, pipeline, preprocessing,)from sklearn.base import BaseEstimator, TransformerMixinfrom tqdm.notebook import tqdmLetsPlot.setup_html()```# Prepare dataframe before modelling## Read in dataframeDrawing from our findings in notebook 4, particularly with regards to our initial feature reduction efforts, we've developed a function named "prepare_data_for_modelling." This function resides in the pre_process.py file, ensuring its reusability. The function performs essential data preprocessing steps, which include:1. Randomly shuffling the rows in the DataFrame.2. Transforming the 'price' column by taking the base 10 logarithm.3. Handling missing values in categorical variables by replacing them with 'missing value.'4. Separating the dataset into features (X) and the target variable (y).Let's dive into the details of this function and prepare our X and y for the subsequent processing pipeline.```{python}df = pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("2023-10-01_Processed_dataset_for_NB_use.parquet.gzip" ))``````{python}def prepare_data(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:""" Prepare data for machine learning modeling. This function takes a DataFrame and prepares it for machine learning by performing the following steps: 1. Randomly shuffles the rows of the DataFrame. 2. Converts the 'price' column to the base 10 logarithm. 3. Fills missing values in categorical variables with 'missing value'. 4. Separates the features (X) and the target (y). Parameters: - df (pd.DataFrame): The input DataFrame containing the dataset. Returns: - Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y). Example use case: # Load your dataset into a DataFrame (e.g., df) df = load_data() # Prepare the data for modeling X, y = prepare_data_for_modelling(df) # Now you can use X and y for machine learning tasks. """ processed_df = ( df.sample(frac=1, random_state=utils.Configuration.seed) .reset_index(drop=True) .assign(price=lambda df: np.log10(df.price)) )# Fill missing categorical variables with "missing value"for col in processed_df.columns:if processed_df[col].dtype.name in ("bool", "object", "category"): processed_df[col] = processed_df[col].fillna("missing value")# Separate features (X) and target (y) X = processed_df.loc[:, utils.Configuration.features_to_keep_v1] y = processed_df[utils.Configuration.target_col]print(f"Shape of X and y: {X.shape}, {y.shape}")return X, y``````{python}X, y = prepare_data(df)```# Cross-validation strategyOur next critical step is to establish a well-structured cross-validation strategy. This step is imperative as it enables us to assess the effectiveness of various feature engineering approaches without risking overfitting our model. A robust cross-validation strategy ensures that our model's performance evaluations are reliable and that the insights gained are generalizable to new data. To accomplish this, we will employ RepeatedKFold validation, setting the parameters with n_splits as 10 and n_repeats as 1.In essence, this configuration signifies that we will perform a 10-fold cross-validation, and this entire process will be repeated once. Importantly, due to the modular nature of this function, we retain the flexibility to easily adapt and alter the design of our cross-validation strategy as needed.```{python}def run_catboost_CV( X: pd.DataFrame, y: pd.Series, n_splits: int=10, n_repeats: int=1, pipeline: Optional[object] =None,) -> Tuple[float, float]:""" Perform Cross-Validation with CatBoost for regression. This function conducts Cross-Validation using CatBoost for regression tasks. It iterates through folds, trains CatBoost models, and computes the mean and standard deviation of the Root Mean Squared Error (RMSE) scores across folds. Parameters: - X (pd.DataFrame): The feature matrix. - y (pd.Series): The target variable. - n_splits (int, optional): The number of splits in K-Fold cross-validation. Defaults to 2. - n_repeats (int, optional): The number of times the K-Fold cross-validation is repeated. Defaults to 1. - pipeline (object, optional): Optional data preprocessing pipeline. If provided, it's applied to the data before training the model. Defaults to None. Returns: - Tuple[float, float]: A tuple containing the mean RMSE and standard deviation of RMSE scores across cross-validation folds. Example: # Load your feature matrix (X) and target variable (y) X, y = load_data() # Perform Cross-Validation with CatBoost mean_rmse, std_rmse = run_catboost_CV(X, y, n_splits=5, n_repeats=2, pipeline=data_pipeline) print(f"Mean RMSE: {mean_rmse:.4f}") print(f"Standard Deviation of RMSE: {std_rmse:.4f}") Notes: - Ensure that the input data `X` and `y` are properly preprocessed and do not contain any missing values. - The function uses CatBoost for regression with optional data preprocessing via the `pipeline`. - RMSE is a common metric for regression tasks, and lower values indicate better model performance. """ results = []# Extract feature names and data types features = X.columns[~X.columns.str.contains("price")] numerical_features = X.select_dtypes("number").columns.to_list() categorical_features = X.select_dtypes("object").columns.to_list()# Create a K-Fold cross-validator CV = model_selection.RepeatedKFold( n_splits=n_splits, n_repeats=n_repeats, random_state=utils.Configuration.seed )for train_fold_index, val_fold_index in tqdm(CV.split(X)): X_train_fold, X_val_fold = X.loc[train_fold_index], X.loc[val_fold_index] y_train_fold, y_val_fold = y.loc[train_fold_index], y.loc[val_fold_index]# Apply optional data preprocessing pipelineif pipeline isnotNone: X_train_fold = pipeline.fit_transform(X_train_fold) X_val_fold = pipeline.transform(X_val_fold)# Create CatBoost datasets catboost_train = Pool( X_train_fold, y_train_fold, cat_features=categorical_features, ) catboost_valid = Pool( X_val_fold, y_val_fold, cat_features=categorical_features, )# Initialize and train the CatBoost model model = catboost.CatBoostRegressor(**utils.Configuration.catboost_params) model.fit( catboost_train, eval_set=[catboost_valid], early_stopping_rounds=utils.Configuration.early_stopping_round, verbose=utils.Configuration.verbose, use_best_model=True, )# Calculate OOF validation predictions valid_pred = model.predict(X_val_fold) RMSE_score = metrics.mean_squared_error(y_val_fold, valid_pred, squared=False) results.append(RMSE_score)return np.mean(results), np.std(results)```Now, let's proceed to train our model with the updated settings:```{python}#| scrolled: truetrain_model.run_catboost_CV(X, y)```::: {.callout-note}Note that we've reduced the number of iterations in Notebook 6 compared to Notebook 5 to minimize the training duration.In Notebook 5: iterations = 1000, default learning rate = 0.03In Notebook 6: iterations = 100, learning rate = 0.2The performance of the baseline model: 0.1125:::# Outlier detectionAn outlier is a data point that significantly differs from the rest of the data. One common way to define an outlier is a data point that falls more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile. Detecting and removing outliers from the dataset is crucial for building a stable model that can effectively generalize to new data.When we create a scatter plot of our features (as shown in @fig-fig1), such as cadastral income against living area, and adjust the points' color and size based on price, we can identify at least two data points that notably deviate from the expected range of values. One data point suggests a 300 m2 property is associated with a cadastral income exceeding 320,000 EURO, while the other point indicates a 2,500 EUR cadastral income for an 11,000 m2 property. Both observations seem implausible when compared to the majority of data points on the graph.```{python}#| fig-cap: 'Assessing Feature Cardinality: Percentage of Unique Values per Feature'#| label: fig-fig1pd.concat([X, y], axis=1).pipe(lambda df: ggplot( df, aes("cadastral_income", "living_area", fill="price", size="price") )+ geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)+ scale_fill_continuous(low="#1a9641", high="#d7191c")+ labs( title="Assessing Potential Outliers", subtitle=""" Outliers pose a challenge for gradient boosting methods since boosting constructs each tree based on the errors of the previous trees. Outliers, having significantly larger errors than non-outliers, can excessively divert the model's attention toward these data points. """, x="Cadastral income (EUR)", y="Living area (m2)", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance )+ ggsize(800, 600))```For identifying potential outliers within our data, we can employ `Scikit-learn`'s `LocalOutlierFactor`. This algorithm, known as the Local Outlier Factor (LOF), is an unsupervised technique for anomaly detection. It assesses the local density deviation of a data point relative to its neighboring points. LOF identifies outliers as those data points demonstrating notably lower density in comparison to their neighbors.In the provided code, we've created a function called `identify_outliers`. This function generates a mask that we can use to filter out data points potentially flagged as outliers.```{python}def identify_outliers(df: pd.DataFrame) -> pd.Series:""" Identify outliers in a DataFrame. This function uses a Local Outlier Factor (LOF) algorithm to identify outliers in a given DataFrame. It operates on both numerical and categorical features, and it returns a binary Series where `True` represents an outlier and `False` represents a non-outlier. Parameters: - df (pd.DataFrame): The input DataFrame containing features for outlier identification. Returns: - pd.Series: A Boolean Series indicating outliers (True) and non-outliers (False). Example: # Load your DataFrame with features (df) df = load_data() # Identify outliers using the function outlier_mask = identify_outliers(df) # Use the outlier mask to filter your DataFrame filtered_df = df[~outlier_mask] # Keep non-outliers Notes: - The function uses Local Outlier Factor (LOF) with default parameters for identifying outliers. - Numerical features are imputed using median values, and categorical features are one-hot encoded and imputed with median values. - The resulting Boolean Series is `True` for outliers and `False` for non-outliers. """# Extract numerical and categorical feature names NUMERICAL_FEATURES = df.select_dtypes("number").columns.tolist() CATEGORICAL_FEATURES = df.select_dtypes("object").columns.tolist()# Define transformers for preprocessing numeric_transformer = pipeline.Pipeline( steps=[("imputer", impute.SimpleImputer(strategy="median"))] ) categorical_transformer = pipeline.Pipeline( steps=[ ("encoder", preprocessing.OneHotEncoder(handle_unknown="ignore")), ("imputer", impute.SimpleImputer(strategy="median")), ] )# Create a ColumnTransformer to handle both numerical and categorical features preprocessor = compose.ColumnTransformer( transformers=[ ("num", numeric_transformer, NUMERICAL_FEATURES), ("cat", categorical_transformer, CATEGORICAL_FEATURES), ] )# Initialize the LocalOutlierFactor model clf = neighbors.LocalOutlierFactor()# Fit LOF to preprocessed data and make predictions y_pred = clf.fit_predict(preprocessor.fit_transform(df))# Adjust LOF predictions to create a binary outlier mask y_pred_adjusted = [1if x ==-1else0for x in y_pred] outlier_mask = pd.Series(y_pred_adjusted) ==0return outlier_mask```As a comparison, here's the scatter plot after removing outliers. It appears that the LocalOutlierFactor method was effective in addressing the outlier data points.```{python}outlier_mask = pre_process.identify_outliers(X)X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True)y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)( pd.concat([X_wo_outliers, y_wo_outliers], axis=1)# .loc[lambda df: pre_process.identify_outliers(df.loc[:, :"living_area"])] .pipe(lambda df: ggplot( df, aes("cadastral_income", "living_area", fill="price", size="price") )+ geom_point(alpha=0.5, shape=21, stroke=0.5, show_legend=False)+ scale_fill_continuous(low="#1a9641", high="#d7191c")+ labs( title="Assessing Potential Outliers", subtitle=""" By employing the default parameters of LocalOutlierFactor, we've reduced our training set from 3660 instances to 3427. This is expected to enhance our model's performance and its ability to generalize well to new data. """, x="Cadastral income (EUR)", y="Living area (m2)", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance )+ ggsize(800, 600) ))```Now, let's assess whether our efforts to improve the model by addressing outliers have enhanced its predictive capabilities:```{python}train_model.run_catboost_CV(X_wo_outliers, y_wo_outliers)```::: {.callout-note}By removing the outliers, our cross-validation RMSE score decreased from 0.1125 to 0.1105.:::# Feature EngineeringFeature engineering is vital in machine learning as it directly influences a model's performance and predictive capabilities. By crafting and selecting pertinent features, it allows the model to capture meaningful patterns and relationships within the data. Effective feature engineering helps improve model accuracy, enhances its ability to generalize to new data, and enables the extraction of valuable insights, ultimately driving the success and efficacy of machine learning algorithms.Feature Engineering ideas we will test in this section:- Utilize categorical columns for grouping and transform each numerical variable based on the median.- Generate bins from the continuous variables and apply the same process as described above.- Introduce polynomial features, either individually with a single feature or in combinations of two features.- Form clusters of instances using k-means clustering to capture data similarities and use these clusters as additional features.- Implement other ideas derived from empirical observations or assumptions## Utilize categorical columns for grouping and transform each numerical variable based on the medianThe idea behind this feature engineering step is to leverage categorical columns as grouping criteria and then calculate the median value for each numerical variable within each group. By doing so, it aims to create new features that capture the central tendency of the numerical data for different categories, allowing the model to better understand and utilize the inherent patterns and variations within the data.```{python}# Number of unique categories per categorical variables:X_wo_outliers.select_dtypes("object").nunique()``````{python}def FE_categorical_transform( X: pd.DataFrame, y: pd.Series, transform_type: str="mean") -> pd.DataFrame:""" Feature Engineering: Transform categorical features using CatBoost Cross-Validation. This function performs feature engineering by transforming categorical features using CatBoost Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean Squared Error (RMSE) scores for various combinations of categorical and numerical features. Parameters: - X (pd.DataFrame): The input DataFrame containing both categorical and numerical features. - transform_type (str, optional): The transformation type, such as "mean" or other valid CatBoost transformations. Defaults to "mean". Returns: - pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "categorical," and "numerical," sorted by "mean_OOFs" in ascending order. Example: # Load your DataFrame with features (X) X = load_data() # Perform feature engineering result_df = FE_categorical_transform(X, transform_type="mean") # View the DataFrame with sorted results print(result_df.head()) Notes: - This function uses CatBoost Cross-Validation to assess the quality of transformations for various combinations of categorical and numerical features. - The resulting DataFrame provides insights into the effectiveness of different transformations. - Feature engineering can help improve the performance of machine learning models. """# Initialize a list to store results results = []# Get a list of categorical and numerical columns categorical_columns = X.select_dtypes("object").columns numerical_columns = X.select_dtypes("number").columns# Combine the loops to have a single progress barfor categorical in tqdm(categorical_columns, desc="Progress"):for numerical in tqdm(numerical_columns):# Create a deep copy of the input data temp = X.copy(deep=True)# Calculate the transformation for each group within the categorical column temp["new_column"] = temp.groupby(categorical)[numerical].transform( transform_type )# Run CatBoost Cross-Validation with the transformed data mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)# Store the results as a tuple result = (mean_OOF, std_OOF, categorical, numerical) results.append(result)del temp, mean_OOF, std_OOF# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs", "std_OOFs", "categorical", "numerical"] ) result_df = result_df.sort_values(by="mean_OOFs")return result_df```::: {.callout-note}Please, bear in mind that these feature engineering steps were precomputed due to the considerable computational time required. The outcomes were saved rather than executed during the notebook rendering to save time. However, it's important to note that the results should remain unchanged.:::```{python}%%script echo skippingFE_categorical_transform_mean = feature_engineering.FE_categorical_transform( X_wo_outliers, y_wo_outliers)``````{python}%%script echo skippingFE_categorical_transform_mean.to_parquet(f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_categorical_transform_mean")}.parquet.gzip', compression="gzip",)FE_categorical_transform_mean.head(15)```As evident, the best result was obtained by treating the city feature as a categorical variable and calculating the median of cadastral_income based on this categorization. This result aligns logically with the feature importances seen in Part 5.```{python}pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_categorical_transform_mean.parquet.gzip" )).head()```## Generate bins from the continuous variablesThe idea behind this feature engineering step is to discretize continuous variables by creating bins or categories from their values. These bins then serve as categorical columns. By using these new categorical columns for grouping, we can transform each numerical variable by replacing its values with the median of the respective category it belongs to, just like the feature engineering method we demonstrated above.```{python}#| scrolled: truedef FE_continuous_transform( X: pd.DataFrame, y: pd.Series, transform_type: str="mean") -> pd.DataFrame:""" Feature Engineering: Transform continuous features using CatBoost Cross-Validation. This function performs feature engineering by transforming continuous features using CatBoost Cross-Validation. It calculates the mean and standard deviation of Out-Of-Fold (OOF) Root Mean Squared Error (RMSE) scores for various combinations of discretized and transformed continuous features. Parameters: - X (pd.DataFrame): The input DataFrame containing both continuous and categorical features. - y (pd.Series): The target variable for prediction. - transform_type (str, optional): The transformation type, such as "mean" or other valid CatBoost transformations. Defaults to "mean". Returns: - pd.DataFrame: A DataFrame with columns "mean_OOFs," "std_OOFs," "discretized_continuous," and "transformed_continuous," sorted by "mean_OOFs" in ascending order. Example: # Load your DataFrame with features (X) and target variable (y) X, y = load_data() # Perform feature engineering result_df = FE_continuous_transform(X, y, transform_type="mean") # View the DataFrame with sorted results print(result_df.head()) Notes: - This function uses CatBoost Cross-Validation to assess the quality of transformations for various combinations of discretized and transformed continuous features. - The number of bins for discretization is determined using Sturges' rule. - The resulting DataFrame provides insights into the effectiveness of different transformations. - Feature engineering can help improve the performance of machine learning models. """# Initialize a list to store results results = []# Get a list of continuous and numerical columns continuous_columns = X.select_dtypes("number").columns optimal_bins =int(np.floor(np.log2(X.shape[0])) +1)# Combine the loops to have a single progress barfor discretized_continuous in tqdm(continuous_columns, desc="Progress:"):for transformed_continuous in tqdm(continuous_columns):if discretized_continuous != transformed_continuous:# Create a deep copy of the input data temp = X.copy(deep=True) discretizer = pipeline.Pipeline( steps=[ ("imputer", impute.SimpleImputer(strategy="median")), ("add_bins", preprocessing.KBinsDiscretizer( encode="ordinal", n_bins=optimal_bins ), ), ] ) temp[discretized_continuous] = discretizer.fit_transform( X[[discretized_continuous]] )# Calculate the transformation for each group within the categorical column temp["new_column"] = temp.groupby(discretized_continuous)[ transformed_continuous ].transform(transform_type)# Run CatBoost Cross-Validation with the transformed data mean_OOF, std_OOF = train_model.run_catboost_CV(temp, y)# Store the results as a tuple result = ( mean_OOF, std_OOF, discretized_continuous, transformed_continuous, ) results.append(result)del temp, mean_OOF, std_OOF# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs","std_OOFs","discretized_continuous","transformed_continuous", ], ) result_df = result_df.sort_values(by="mean_OOFs")return result_df``````{python}#| scrolled: true%%script echo skippingFE_continuous_transform_mean = feature_engineering.FE_continuous_transform( X_wo_outliers, y_wo_outliers)FE_continuous_transform_mean.to_parquet(f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_continuous_transform_mean")}.parquet.gzip', compression="gzip",)FE_categorical_transform_mean.head(15)```This approach was not as effective as our prior method. However, combining bathrooms with yearly_theoretical_total_energy_consumption yielded the best outcome.```{python}pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_continuous_transform_mean.parquet.gzip" )).head(10)```## Introduce polynomial featuresThe idea behind introducing polynomial features is to capture non-linear relationships within the data. By raising individual features to higher powers or considering interactions between pairs of features, this step allows the model to better represent complex patterns that cannot be adequately expressed with linear relationships alone. It enhances the model's ability to learn and predict outcomes that exhibit curvilinear or interactive behavior.```{python}def FE_polynomial_features( X: pd.DataFrame, y: pd.Series, combinations: int=1) -> pd.DataFrame:""" Generate polynomial features for combinations of numerical columns and train a CatBoost model. Parameters: X (pd.DataFrame): The input DataFrame with features. y (pd.Series): The target variable. combinations (int, optional): The number of combinations of numerical columns. Default is 1. Returns: pd.DataFrame: A DataFrame containing results sorted by mean OOF scores. Example: X_wo_outliers = pd.DataFrame(...) # Your input data y_wo_outliers = pd Series(...) # Your target variable result = FE_polynomial_features(X_wo_outliers, y_wo_outliers) Transformations: - Imputes missing values in numerical columns using the median. - Generates polynomial features, including interaction terms, for selected numerical columns. - Trains a CatBoost model and calculates mean and standard deviation of out-of-fold (OOF) scores. """# Initialize a list to store results results = []# Get a list of continuous and numerical columns numerical_columns = X.select_dtypes("number").columns# Combine the loops to have a single progress barfor numerical_col in tqdm(list(itertools.combinations(numerical_columns, r=combinations)) ): polyfeatures = compose.make_column_transformer( ( pipeline.make_pipeline( impute.SimpleImputer(strategy="median"), preprocessing.PolynomialFeatures(interaction_only=False), ),list(numerical_col), ), remainder="passthrough", ).set_output(transform="pandas") temp = polyfeatures.fit_transform(X) mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)# Store the results as a tuple result = ( mean_OOF, std_OOF, numerical_col, ) results.append(result)del temp, mean_OOF, std_OOF# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs","std_OOFs","numerical_col", ], ) result_df = result_df.sort_values(by="mean_OOFs")return result_df```### n=1Let's see the impact of applying polynomial feature engineering to a single feature.```{python}%%script echo skippingFE_polynomial_features_combinations_1 = FE_polynomial_features( X_wo_outliers, y_wo_outliers)FE_polynomial_features_combinations_1.to_parquet(f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_1")}.parquet.gzip', compression="gzip",)FE_polynomial_features_combinations_1.head(15)``````{python}pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_1.parquet.gzip" )).head(10)```### n=2How about two features combined...```{python}%%script echo skippingFE_polynomial_features_combinations_2 = FE_polynomial_features( X_wo_outliers, y_wo_outliers, combinations=2)FE_polynomial_features_combinations_2.to_parquet(f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_2")}.parquet.gzip', compression="gzip",)FE_polynomial_features_combinations_2.head(15)``````{python}pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_polynomial_features_combinations_2.parquet.gzip" )).head(10)```## Form clusters of instances using k-means clusteringThe idea behind using k-means clustering in feature engineering is to group data points into clusters based on their similarity. By doing so, we create a new set of features that represent these clusters, which can capture patterns or relationships within the data that might be less apparent in the original features. These cluster features can be valuable for machine learning models, as they provide a more compact and informative representation of the data, potentially improving predictive performance.```{python}class FeatureSelector(BaseEstimator, TransformerMixin):""" A transformer for selecting specific columns from a DataFrame. This class inherits from the BaseEstimator and TransformerMixin classes from sklearn.base. It overrides the fit and transform methods from the parent classes. Attributes: feature_names_in_ (list): The names of the features to select. n_features_in_ (int): The number of features to select. Methods: fit(X, y=None): Fit the transformer. Returns self. transform(X, y=None): Apply the transformation. Returns a DataFrame with selected features. """def__init__(self, feature_names_in_):""" Constructs all the necessary attributes for the FeatureSelector object. Args: feature_names_in_ (list): The names of the features to select. """self.feature_names_in_ = feature_names_in_self.n_features_in_ =len(feature_names_in_)def fit(self, X, y=None):""" Fit the transformer. This method doesn't do anything as no fitting is necessary. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: self: The instance itself. """returnselfdef transform(self, X, y=None):""" Apply the transformation. Selects the features from the input data. Args: X (DataFrame): The input data. y (array-like, optional): The target variable. Defaults to None. Returns: DataFrame: A DataFrame with only the selected features. """return X.loc[:, self.feature_names_in_].copy(deep=True)``````{python}def FE_KMeans( X: pd.DataFrame, y: pd.Series, n_clusters_min: int=1, n_clusters_max: int=8,) -> pd.DataFrame:"""Performs K-Means clustering-based feature engineering followed by model training. Args: X (pd.DataFrame): The input feature matrix. y (pd.Series): The target variable. n_clusters_min (int, optional): The minimum number of clusters to consider. Defaults to 1. n_clusters_max (int, optional): The maximum number of clusters to consider. Defaults to 8. Returns: pd.DataFrame: A DataFrame containing the results of feature engineering with K-Means clustering. Example: >>> results_df = FE_KNN(X_wo_outliers, y_wo_outliers) """# Initialize a list to store results results = []# Get a list of continuous and numerical columns numerical_columns = X.head().select_dtypes("number").columns.to_list() categorical_columns = X.head().select_dtypes("object").columns.to_list()for n_cluster in tqdm(range(n_clusters_min, n_clusters_max)):# Prepare pipelines for corresponding columns: numerical_pipeline = pipeline.Pipeline( steps=[ ("num_selector", FeatureSelector(numerical_columns)), ("imputer", impute.SimpleImputer(strategy="median")), ] ) categorical_pipeline = pipeline.Pipeline( steps=[ ("cat_selector", FeatureSelector(categorical_columns)), ("imputer", impute.SimpleImputer(strategy="most_frequent")), ("onehot", preprocessing.OneHotEncoder( handle_unknown="ignore", sparse_output=False ), ), ] )# Put all the pipelines inside a FeatureUnion: data_preprocessing_pipeline = pipeline.FeatureUnion( n_jobs=-1, transformer_list=[ ("numerical_pipeline", numerical_pipeline), ("categorical_pipeline", categorical_pipeline), ], ) temp = pd.DataFrame(data_preprocessing_pipeline.fit_transform(X)) KMeans = cluster.KMeans(n_init=10, n_clusters=n_cluster) KMeans.fit_transform(temp) groups = pd.Series(KMeans.labels_, name="groups") concatanated_df = pd.concat([temp, groups], axis="columns") mean_OOF, std_OOF = train_model.run_catboost_CV(X=concatanated_df, y=y)# Store the results as a tuple result = ( mean_OOF, std_OOF, n_cluster, ) results.append(result)del temp, mean_OOF, std_OOF, KMeans, groups, concatanated_df, result# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs","std_OOFs","n_cluster", ], ) result_df = result_df.sort_values(by="mean_OOFs")return result_df``````{python}%%script echo skippingFE_KMeans_df = FE_KMeans( X_wo_outliers, y_wo_outliers, n_clusters_min=1, n_clusters_max=101,)FE_KMeans_df.to_parquet(f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_KNN_df")}.parquet.gzip', compression="gzip",)FE_KNN_df.head(15)```As k-means clustering is an unsupervised algorithm, determining the appropriate k-values requires testing various values to assess their impact on our validation scores. As observed, this approach didn't yield significant results in our case, as the best validation score was obtained when n=1.```{python}pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_KNN_df.parquet.gzip")).head(10)```## Implement other ideas derived from empirical observations or assumptionsThough new features can be generated through systematic methods, domain knowledge can also inspire their creation. The idea behind this is to allow for the incorporation of unconventional or domain-specific insights that may not fit standard feature engineering techniques. It encourages the exploration of novel features or transformations based on practical experiences or theoretical assumptions to potentially uncover hidden patterns or relationships within the data. This open-ended approach can lead to creative and tailored feature engineering solutions.**Here are some ideas to consider:**1. **Geospatial Features:** - Create clusters or neighborhoods based on features to capture similarities.4. **Area-related Features:** - Calculate the ratio of "living_area" to "surface_of_the_plot" to get an idea of the density or spaciousness of the property.5. **Energy Efficiency Features:** - Compute the energy efficiency ratio by dividing "yearly_theoretical_total_energy_consumption" by "primary_energy_consumption." - Compute energy efficiency by dividing primary_energy_consumption with living_area6. **Toilet and Bathroom Features:** - Combine "toilets" and "bathrooms" into a single "total_bathrooms" feature to simplify the model. - Calculate total number of rooms by adding up bedrooms + toilets + bathrooms8. **Taxation Features:** - Incorporate "cadastral_income" as a measure of property value for taxation. You can create bins or categories for this variable.9. **Value for Money:** - Divide cadastral_income by bedrooms to see if the property is a good bargain - similarly, Divide cadastral_income by living_area```{python}def FE_ideas(X):"""Performs additional feature engineering on the input DataFrame. Args: X (pd.DataFrame): The input DataFrame containing the original features. Returns: pd.DataFrame: A DataFrame with additional engineered features. Example: >>> engineered_data = FE_ideas(original_data) """ temp = X.assign( energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption/ df.primary_energy_consumption, energy_efficiency_2=lambda df: df.primary_energy_consumption / df.living_area, total_bathrooms=lambda df: df.toilets + df.bathrooms, total_number_rooms=lambda df: df.toilets + df.bathrooms + df.bedrooms, spaciousness_1=lambda df: df.living_area / df.surface_of_the_plot, spaciousness_2=lambda df: df.living_area / df.total_number_rooms, bargain_1=lambda df: df.cadastral_income / df.bedrooms, bargain_2=lambda df: df.cadastral_income / df.living_area, )return temp.loc[:, "energy_efficiency_1":]``````{python}def FE_try_ideas( X: pd.DataFrame, y: pd.Series,) -> pd.DataFrame:"""Performs feature engineering experiments by adding new features and evaluating their impact on model performance. Args: X (pd.DataFrame): The input feature matrix. y (pd.Series): The target variable. Returns: pd.DataFrame: A DataFrame containing the results of feature engineering experiments. Example: >>> results_df = FE_try_ideas(X, y) """# Initialize a list to store results results = []# Get a list of continuous and numerical columns numerical_columns = X.select_dtypes("number").columns# Apply additional feature engineering ideas feature_df = FE_ideas(X)for feature in tqdm(feature_df.columns):# Concatenate the original features with the newly engineered feature temp = pd.concat([X, feature_df[feature]], axis="columns")# Train the model with the augmented features and get the mean and standard deviation of OOF scores mean_OOF, std_OOF = train_model.run_catboost_CV(X=temp, y=y)# Store the results as a tuple result = ( mean_OOF, std_OOF, feature, ) results.append(result)del temp, mean_OOF, std_OOF# Create a DataFrame from the results and sort it by mean OOF scores result_df = pd.DataFrame( results, columns=["mean_OOFs","std_OOFs","feature", ], ) result_df = result_df.sort_values(by="mean_OOFs")return result_df``````{python}%%script echo skippingFE_try_ideas = FE_try_ideas(X_wo_outliers, y_wo_outliers)FE_try_ideas.to_parquet(f'{utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_try_ideas")}.parquet.gzip', compression="gzip",)FE_try_ideas```As can be seen below, the best feature this time was `spaciousness_1`, representing `df.living_area` divided by `df.surface_of_the_plot`.```{python}pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("FE_try_ideas.parquet.gzip"))```# Summary table of the tested conditionsThe initial model achieved the best mean out-of-folds score of 0.1107. However, we made modifications to expedite training by reducing iterations to 100 and increasing the learning rate to 0.2, resulting in a new baseline model with a score of 0.1105 after outlier removal. This serves as our reference point to assess the impact of various feature engineering techniques.Subsequent feature engineering approaches, including utilizing categorical columns for groupby and transformation, creating bins from continuous data, and implementing other ideas, led to marginal score improvements, with the lowest at 0.1089. Polynomial features, with n=2 and n=1, demonstrated slightly higher scores of 0.1094 and 0.1099, respective. Now, we will proceed to assess the efficacy of two of the best approaches, namely: utilizing categorical columns for groupby and transformation and implementing additional ideas. We will conduct this evaluation using CatBoost's built-in `select_features` as outlined in part 5. Let's dive in...| **Condition** | **Best mean OOFs** | **std OOFs** ||--------------------------------------------------------------|:------------------:|:------------:|| **_Original_** | **_0.1107_** | **_NA_** || Use categorical columns for groupby/transform | 0.1089 | 0.0062 || Create bins from continuous data and use groupby/transform | 0.1093 | 0.0050 || Implementing the rest of the ideas | 0.1093 | 0.0046 || Polynomial features (n=2) | 0.1094 | 0.0046 || Polynomial features (n=1) | 0.1099 | 0.0050 || After Outlier filter | 0.1105 | 0.0045 || k-means clustering | 0.1118 | 0.0051 || Sped up version | 0.1125 | 0.0044 |.0044 |.0044 |0.0044 |# Final feature selection```{python}def prepare_df_for_final_feature_selection(X):return X.assign( city_group=lambda df: df.groupby("city")["cadastral_income"].transform("median" ), building_condition_group=lambda df: df.groupby("building_condition")["yearly_theoretical_total_energy_consumption" ].transform("median"), energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption/ df.primary_energy_consumption, energy_efficiency_2=lambda df: df.primary_energy_consumption / df.living_area, total_bathrooms=lambda df: df.toilets + df.bathrooms, total_number_rooms=lambda df: df.toilets + df.bathrooms + df.bedrooms, spaciousness_1=lambda df: df.living_area / df.surface_of_the_plot, spaciousness_2=lambda df: df.living_area / df.total_number_rooms, bargain_1=lambda df: df.cadastral_income / df.bedrooms, bargain_2=lambda df: df.cadastral_income / df.living_area, )X_final_feature_selection = prepare_df_for_final_feature_selection(X_wo_outliers)``````{python}X_train, X_val, y_train, y_val = model_selection.train_test_split( X_final_feature_selection, y_wo_outliers, test_size=0.2, random_state=utils.Configuration.seed,)``````{python}regressor = catboost.CatBoostRegressor( iterations=1000, cat_features=X_final_feature_selection.select_dtypes("object").columns.to_list(), random_seed=utils.Configuration.seed, loss_function="RMSE",)rfe_dict = regressor.select_features( algorithm="RecursiveByShapValues", shap_calc_type="Exact", X=X_train, y=y_train, eval_set=(X_val, y_val), features_for_select="0-25", num_features_to_select=1, steps=20, verbose=250, train_final_model=False, plot=True,)```Based on our evaluation, it is recommended to retain the following features:- 'city_group'- 'building_condition_group'- 'energy_efficiency_1'- 'energy_efficiency_2'- 'bargain_1'- 'bargain_2'However, we should remove the following two features since our analysis indicates that better features have been incorporated:- 'kitchen_type'- 'toilets'These feature selections should help optimize our model even further. Based on these insights, we crafted the `prepare_data_for_modelling` function, which is stored in the `pre_process.py` file. This function includes the feature engineering steps we discussed, setting the stage for effective modeling performance.```{python}def prepare_data_for_modelling(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:""" Prepare data for machine learning modeling. This function takes a DataFrame and prepares it for machine learning by performing the following steps: 1. Randomly shuffles the rows of the DataFrame. 2. Converts the 'price' column to the base 10 logarithm. 3. Fills missing values in categorical variables with 'missing value'. 4. Separates the features (X) and the target (y). 5. Identifies and filters out outlier values based on LocalOutlierFactor. Parameters: - df (pd.DataFrame): The input DataFrame containing the dataset. Returns: - Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y). Example use case: # Load your dataset into a DataFrame (e.g., df) df = load_data() # Prepare the data for modeling X, y = prepare_data_for_modelling(df) # Now you can use X and y for machine learning tasks. Args: df (pd.DataFrame): The input DataFrame containing the dataset. Returns: Tuple[pd.DataFrame, pd.Series]: A tuple containing the prepared features (X) and the target (y). """ processed_df = ( df.sample(frac=1, random_state=utils.Configuration.seed) .reset_index(drop=True) .assign( price=lambda df: np.log10(df.price), city_group=lambda df: df.groupby("city")["cadastral_income"].transform("median" ), building_condition_group=lambda df: df.groupby("building_condition")["yearly_theoretical_total_energy_consumption" ].transform("median"), energy_efficiency_1=lambda df: df.yearly_theoretical_total_energy_consumption/ df.primary_energy_consumption, energy_efficiency_2=lambda df: df.primary_energy_consumption/ df.living_area, bargain_1=lambda df: df.cadastral_income / df.bedrooms, bargain_2=lambda df: df.cadastral_income / df.living_area, ) )# Fill missing categorical variables with "missing value"for col in processed_df.columns:if processed_df[col].dtype.name in ("bool", "object", "category"): processed_df[col] = processed_df[col].fillna("missing value")# Separate features (X) and target (y) X = processed_df.loc[:, utils.Configuration.features_to_keep_v2] y = processed_df[utils.Configuration.target_col] outlier_mask = pre_process.identify_outliers(X) X_wo_outliers = X.loc[outlier_mask, :].reset_index(drop=True) y_wo_outliers = y.loc[outlier_mask].reset_index(drop=True)print(f"Shape of X and y with outliers: {X.shape}, {y.shape}")print(f"Shape of X and y without outliers: {X_wo_outliers.shape}, {y_wo_outliers.shape}" )return X_wo_outliers, y_wo_outliers``````{python}X, y = pre_process.prepare_data_for_modelling(df)```In this notebook, we began with an initial selection of 16 features based on the work in Part 5. However, as we conclude this article, we've found that we can streamline our feature set even further by removing `kitchen_type` and `toilets` resulting in improved performance, thanks to the addition of new features. While there's potential for further optimizations, such as dimensional reduction, we are currently happy with our progress. In the next, and final, part, we will focus on fine-tuning our model for optimal predictive performance. See you there!