Code
import sys
from pathlib import Path
str(Path.cwd())) sys.path.append(
Adam Cseresznye
November 20, 2024
Welcome to Peek My Home Price
, a project, that dives into the key factors that influence real estate property prices in Belgium. Our ultimate goal with this project is to leverage up-to-date data from leading real estate platforms in the country to accurately predict house prices. We aim to create a platform that allows users to gain insights into the dynamic Belgian real estate market, province by province.
You can explore the project’s app on its website. For more details, visit the GitHub repository.
Check out the series for a deeper dive: - Part 1: Characterizing the Data - Part 2: Building a Baseline Model - Part 3: Feature Selection - Part 4: Feature Engineering - Part 5: Fine-Tuning
The app is divided into three main sections:
In this blog series, we’ll take you behind the scenes of Peek My Home Price
and guide you through the thought process that led to the creation of the application. Feel free to explore the topics that pique your interest or that you’d like to learn more about. We hope you’ll find this information valuable for your own projects. Let’s get started!
The data used to train our model is sourced from prominent Belgian real estate platforms. Employing the Scrapy
framework, we systematically extract relevant features. The collected data then undergoes a preprocessing pipeline, including duplicate removal and data type conversion. The cleaned dataset is subsequently stored in a PostgreSQL
database, ready for model training. If you would like to learn more about this step, please visit the src/scraper
module.
Before diving into analysis, it’s crucial to take a closer look at our dataset’s preliminary version. This step is essential because, often, we collect more data than we actually need. By examining the initial dataset, we can gain a deeper understanding of the relationships between variables and our target variable of interest – in this case, price. For this, we will use a sample dataset that contains more features than we ended up using. It’s often not so much about deciding what data to collect but rather what data to retain. First, we dive into the initial data collected to examine the features that are commonly shared among most ads. After identifying these common attributes, we can optimize our data collection process by keeping these key characteristics and removing the less common ones.
As depicted in Figure 1, the features energy_class
, lat
, lng
demonstrate the highest completeness, with more than 90% of instances being present. In contrast, subdivision_permit
, yearly_theoretical_energy_consumption
and width_of_the_total_lot
are among the least populated features, with roughly 10-15% of non-missing instances.
This information allows us to devise a strategy where we, for example, could retain features with a completeness of over 50%.
# Getting the column names with lowest missing values
lowest_missing_value_columns = (
df.notna()
.sum()
.div(df.shape[0])
.mul(100)
.sort_values(ascending=False)
.head(50)
.round(1)
)
indexes_to_keep = lowest_missing_value_columns.index
(
lowest_missing_value_columns.reset_index()
.rename(columns={"index": "column", 0: "perc_values_present"})
.assign(
Has_non_missing_values_above_50_pct=lambda df: df.perc_values_present.gt(50),
perc_values_present=lambda df: df.perc_values_present - 50,
)
.pipe(
lambda df: ggplot(
df,
aes(
"perc_values_present",
"column",
fill="Has_non_missing_values_above_50_pct",
),
)
+ geom_bar(stat="identity", orientation="y", show_legend=False)
+ ggsize(800, 1000)
+ labs(
title="Top 50 Features with Non-Missing Values Above 50%",
subtitle="""The plot illustrates that the features such as'energy class,' 'lng' and 'lat' exhibited the
highest completeness, with over 90% of instances present. Conversely, 'subdivision_permit', was among
the least populated features, with approximately 10% of non-missing instances.
""",
x="Percentage of Instances Present with Reference Point at 50%",
y="",
)
+ theme(
plot_subtitle=element_text(size=12, face="italic"),
plot_title=element_text(size=15, face="bold"),
)
+ ggsize(1000, 600)
)
)
Now, let’s assess the feature cardinality of our dataset to differentiate between categorical and numerical variables. To do this, we will analyze the percentage of unique values per feature.
# Assuming df is your DataFrame
number_unique_entries = {
"column_name": df.columns.tolist(),
"column_dtype": [df[col].dtype for col in df.columns],
"unique_values_pct": [df[col].nunique() for col in df.columns],
}
(
pd.DataFrame(number_unique_entries)
.sort_values("unique_values_pct")
.assign(
unique_values_pct=lambda x: x.unique_values_pct.div(df.shape[0])
.mul(100)
.round(1)
)
.pipe(
lambda df: ggplot(df, aes("unique_values_pct", "column_name"))
+ geom_bar(stat="identity", orientation="y")
+ labs(
title="Assessing Feature Cardinality",
subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be Used as Categorical Variables,
while Those with Higher Cardinality, typically represented as floats or ints, May Be Used as They Are
""",
x="Percentage of Unique Values per Feature",
y="",
)
+ theme(
plot_subtitle=element_text(
size=12, face="italic"
), # Customize subtitle appearance
plot_title=element_text(size=15, face="bold"), # Customize title appearance
)
+ ggsize(800, 1000)
)
)
Upon examining the distribution of our target variable, which is the price
, it becomes evident that there is a notable skew. Our median price stands at 379,000 EUR, with the lowest at 350,000 EUR and the highest reaching 10 million EUR. To increase the accuracy of our predictions, it is worth considering a transformation of our target variable before proceeding with modeling. This transformation serves several beneficial purposes:
Normalization: It has the potential to render the distribution of the target variable more symmetrical, resembling a normal distribution. Such a transformation can significantly enhance the performance of various regression models.
Equalizing Variance: By stabilizing the variance of the target variable across different price ranges, this transformation becomes particularly valuable for ensuring the effectiveness of certain regression algorithms.
Mitigating Outliers: It is effective at diminishing the impact of extreme outliers, bolstering the model’s robustness against data anomalies.
Interpretability: Notably, when interpreting model predictions, this transformation allows for straightforward back-transformation to the original scale. This can be achieved using a base 10 exponentiation, ensuring that predictions are easily interpretable in their origination task.
before_transformation = df.pipe(
lambda df: ggplot(df, aes("price")) + geom_histogram()
) + labs(
title="Before Transformation",
)
after_transformation = df.assign(price=lambda df: np.log10(df.price)).pipe(
lambda df: ggplot(df, aes("price"))
+ geom_histogram()
+ labs(
title="After log10 Transformation",
)
)
gggrid([before_transformation, after_transformation], ncol=2) + ggsize(800, 300)
Next, we will investigate how house prices vary when grouped according to our independent variables. Please take into account that the price values have undergone log transformation to address skewness.
low_cardinality_features = (
pd.DataFrame(number_unique_entries)
.query("unique_values_pct <= 5")
.column_name.to_list()
)
high_cardinality_features = (
pd.DataFrame(number_unique_entries)
.query("(unique_values_pct >= 5)")
.loc[lambda df: (df.column_dtype == "float32") | (df.column_dtype == "float64"), :]
.column_name.to_list()
)
plots = []
for idx, feature in enumerate(low_cardinality_features):
plot = (
df.melt(id_vars=["price"])
.loc[lambda df: df.variable == feature, :]
.assign(price=lambda df: np.log10(df.price))
.pipe(
lambda df: ggplot(
df,
aes(as_discrete("value"), "price"),
)
+ facet_wrap("variable")
+ geom_boxplot(
show_legend=False,
)
)
)
plots.append(plot)
gggrid(plots, ncol=4) + ggsize(900, 1600)
Finally, we will look into the correlations among variables with high cardinality through Spearman correlation analysis. As evident from the heatmap, the price exhibits a strong correlation with cadastral income (correlation coefficient = 0.77), living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59). For your reference, cadastral income is an annual Flemish property tax based on the assessed rental income of immovable properties in the Flemish Region. This income is a notional rental income assigned to each property, whether it is rented out or not.
(
df.loc[:, lambda df: df.columns.isin(high_cardinality_features)]
.corr(method="spearman")
.pipe(
lambda df: corr_plot(df)
.tiles(
"lower",
)
.labels(type="lower", map_size=False)
.palette_gradient(low="#2986cc", mid="#ffffff", high="#d73027")
.build()
+ ggsize(900, 900)
+ labs(
title="Spearman Correlations Among High Cardinality Features",
subtitle=""" The price demonstrates robust correlations with key factors, including cadastral income (correlation coefficient = 0.77),
living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59)
""",
x="Number of Unique Values per Feature",
y="",
)
+ theme(
plot_subtitle=element_text(
size=12, face="italic"
), # Customize subtitle appearance
plot_title=element_text(size=15, face="bold"), # Customize title appearance
)
)
)
Having laid the groundwork with initial data exploration in Part 1, we’re now ready to take the next step: building a foundational machine learning model. In Part 2, we’ll put various algorithms to the test, establishing a benchmark that will serve as a reference point for our future model-building and feature engineering efforts. This baseline model will provide a crucial starting point, guiding us as we work to refine and enhance our predictive capabilities. See you there!