Peek My Home Price Part-1: Characterizing the data

Belgian Housing Market Insights
Author

Adam Cseresznye

Published

November 20, 2024

Peek My Home Price Home Page

Welcome to Peek My Home Price, a project, that dives into the key factors that influence real estate property prices in Belgium. Our ultimate goal with this project is to leverage up-to-date data from leading real estate platforms in the country to accurately predict house prices. We aim to create a platform that allows users to gain insights into the dynamic Belgian real estate market, province by province.

Note

You can explore the project’s app on its website. For more details, visit the GitHub repository.

Check out the series for a deeper dive: - Part 1: Characterizing the Data - Part 2: Building a Baseline Model - Part 3: Feature Selection - Part 4: Feature Engineering - Part 5: Fine-Tuning

The app is divided into three main sections:

  1. Dashboard: Get a quick snapshot of the latest real estate trends, including average prices, the most active regions, and interesting facts about the dataset.
  2. Trends: Dive deeper into historical price trends. Explore median price changes over time for each Belgian province.
  3. Prediction: Input specific variables and generate your own price predictions based on our latest trained model.

In this blog series, we’ll take you behind the scenes of Peek My Home Price and guide you through the thought process that led to the creation of the application. Feel free to explore the topics that pique your interest or that you’d like to learn more about. We hope you’ll find this information valuable for your own projects. Let’s get started!

Code
import sys
from pathlib import Path

sys.path.append(str(Path.cwd()))
Code
import time
from pathlib import Path

import numpy as np
import pandas as pd
from lets_plot import *
from lets_plot.bistro.corr import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

Web scraping

The data used to train our model is sourced from prominent Belgian real estate platforms. Employing the Scrapy framework, we systematically extract relevant features. The collected data then undergoes a preprocessing pipeline, including duplicate removal and data type conversion. The cleaned dataset is subsequently stored in a PostgreSQL database, ready for model training. If you would like to learn more about this step, please visit the src/scraper module.

Describing the data

Before diving into analysis, it’s crucial to take a closer look at our dataset’s preliminary version. This step is essential because, often, we collect more data than we actually need. By examining the initial dataset, we can gain a deeper understanding of the relationships between variables and our target variable of interest – in this case, price. For this, we will use a sample dataset that contains more features than we ended up using. It’s often not so much about deciding what data to collect but rather what data to retain. First, we dive into the initial data collected to examine the features that are commonly shared among most ads. After identifying these common attributes, we can optimize our data collection process by keeping these key characteristics and removing the less common ones.

Code
df = pd.read_parquet(
    Path.cwd().joinpath("data").joinpath("2023-10-01_Processed_dataset_for_NB_use.gzip")
)

As depicted in Figure 1, the features energy_class, lat, lng demonstrate the highest completeness, with more than 90% of instances being present. In contrast, subdivision_permit, yearly_theoretical_energy_consumption and width_of_the_total_lot are among the least populated features, with roughly 10-15% of non-missing instances.

This information allows us to devise a strategy where we, for example, could retain features with a completeness of over 50%.

Code
# Getting the column names with lowest missing values
lowest_missing_value_columns = (
    df.notna()
    .sum()
    .div(df.shape[0])
    .mul(100)
    .sort_values(ascending=False)
    .head(50)
    .round(1)
)
indexes_to_keep = lowest_missing_value_columns.index

(
    lowest_missing_value_columns.reset_index()
    .rename(columns={"index": "column", 0: "perc_values_present"})
    .assign(
        Has_non_missing_values_above_50_pct=lambda df: df.perc_values_present.gt(50),
        perc_values_present=lambda df: df.perc_values_present - 50,
    )
    .pipe(
        lambda df: ggplot(
            df,
            aes(
                "perc_values_present",
                "column",
                fill="Has_non_missing_values_above_50_pct",
            ),
        )
        + geom_bar(stat="identity", orientation="y", show_legend=False)
        + ggsize(800, 1000)
        + labs(
            title="Top 50 Features with Non-Missing Values Above 50%",
            subtitle="""The plot illustrates that the features such as'energy class,' 'lng' and 'lat' exhibited the 
            highest completeness, with over 90% of instances present. Conversely, 'subdivision_permit', was among 
            the least populated features, with approximately 10% of non-missing instances.
            """,
            x="Percentage of Instances Present with Reference Point at 50%",
            y="",
        )
        + theme(
            plot_subtitle=element_text(size=12, face="italic"),
            plot_title=element_text(size=15, face="bold"),
        )
        + ggsize(1000, 600)
    )
)
Figure 1: Top 50 Features with Non-Missing Values Above 50%

Assessing Feature Cardinality

Now, let’s assess the feature cardinality of our dataset to differentiate between categorical and numerical variables. To do this, we will analyze the percentage of unique values per feature.

Code
# Assuming df is your DataFrame
number_unique_entries = {
    "column_name": df.columns.tolist(),
    "column_dtype": [df[col].dtype for col in df.columns],
    "unique_values_pct": [df[col].nunique() for col in df.columns],
}

(
    pd.DataFrame(number_unique_entries)
    .sort_values("unique_values_pct")
    .assign(
        unique_values_pct=lambda x: x.unique_values_pct.div(df.shape[0])
        .mul(100)
        .round(1)
    )
    .pipe(
        lambda df: ggplot(df, aes("unique_values_pct", "column_name"))
        + geom_bar(stat="identity", orientation="y")
        + labs(
            title="Assessing Feature Cardinality",
            subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be Used as Categorical Variables, 
            while Those with Higher Cardinality, typically represented as floats or ints, May Be Used as They Are
            """,
            x="Percentage of Unique Values per Feature",
            y="",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 1000)
    )
)
Figure 2: Assessing Feature Cardinality: Percentage of Unique Values per Feature

Distribution of the target variable

Upon examining the distribution of our target variable, which is the price, it becomes evident that there is a notable skew. Our median price stands at 379,000 EUR, with the lowest at 350,000 EUR and the highest reaching 10 million EUR. To increase the accuracy of our predictions, it is worth considering a transformation of our target variable before proceeding with modeling. This transformation serves several beneficial purposes:

  1. Normalization: It has the potential to render the distribution of the target variable more symmetrical, resembling a normal distribution. Such a transformation can significantly enhance the performance of various regression models.

  2. Equalizing Variance: By stabilizing the variance of the target variable across different price ranges, this transformation becomes particularly valuable for ensuring the effectiveness of certain regression algorithms.

  3. Mitigating Outliers: It is effective at diminishing the impact of extreme outliers, bolstering the model’s robustness against data anomalies.

  4. Interpretability: Notably, when interpreting model predictions, this transformation allows for straightforward back-transformation to the original scale. This can be achieved using a base 10 exponentiation, ensuring that predictions are easily interpretable in their origination task.

Code
before_transformation = df.pipe(
    lambda df: ggplot(df, aes("price")) + geom_histogram()
) + labs(
    title="Before Transformation",
)
after_transformation = df.assign(price=lambda df: np.log10(df.price)).pipe(
    lambda df: ggplot(df, aes("price"))
    + geom_histogram()
    + labs(
        title="After log10 Transformation",
    )
)
gggrid([before_transformation, after_transformation], ncol=2) + ggsize(800, 300)
Figure 3: Target distribution before and after log10 transformation

Relationship between independent and dependent variables

Next, we will investigate how house prices vary when grouped according to our independent variables. Please take into account that the price values have undergone log transformation to address skewness.

Code
low_cardinality_features = (
    pd.DataFrame(number_unique_entries)
    .query("unique_values_pct <= 5")
    .column_name.to_list()
)

high_cardinality_features = (
    pd.DataFrame(number_unique_entries)
    .query("(unique_values_pct >= 5)")
    .loc[lambda df: (df.column_dtype == "float32") | (df.column_dtype == "float64"), :]
    .column_name.to_list()
)
Code
plots = []

for idx, feature in enumerate(low_cardinality_features):
    plot = (
        df.melt(id_vars=["price"])
        .loc[lambda df: df.variable == feature, :]
        .assign(price=lambda df: np.log10(df.price))
        .pipe(
            lambda df: ggplot(
                df,
                aes(as_discrete("value"), "price"),
            )
            + facet_wrap("variable")
            + geom_boxplot(
                show_legend=False,
            )
        )
    )
    plots.append(plot)
gggrid(plots, ncol=4) + ggsize(900, 1600)
Figure 4: Exploring Price Variations Across Different Variables

Correlations

Finally, we will look into the correlations among variables with high cardinality through Spearman correlation analysis. As evident from the heatmap, the price exhibits a strong correlation with cadastral income (correlation coefficient = 0.77), living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59). For your reference, cadastral income is an annual Flemish property tax based on the assessed rental income of immovable properties in the Flemish Region. This income is a notional rental income assigned to each property, whether it is rented out or not.

Code
(
    df.loc[:, lambda df: df.columns.isin(high_cardinality_features)]
    .corr(method="spearman")
    .pipe(
        lambda df: corr_plot(df)
        .tiles(
            "lower",
        )
        .labels(type="lower", map_size=False)
        .palette_gradient(low="#2986cc", mid="#ffffff", high="#d73027")
        .build()
        + ggsize(900, 900)
        + labs(
            title="Spearman Correlations Among High Cardinality Features",
            subtitle=""" The price demonstrates robust correlations with key factors, including cadastral income (correlation coefficient = 0.77), 
            living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59)
            """,
            x="Number of Unique Values per Feature",
            y="",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
    )
)
Figure 5: Spearman Correlations Among High Cardinality Features

Having laid the groundwork with initial data exploration in Part 1, we’re now ready to take the next step: building a foundational machine learning model. In Part 2, we’ll put various algorithms to the test, establishing a benchmark that will serve as a reference point for our future model-building and feature engineering efforts. This baseline model will provide a crucial starting point, guiding us as we work to refine and enhance our predictive capabilities. See you there!