Predicting Belgian Real Estate Prices: Part 1: Feature Selection for Web Scraping

Predicting Belgian Real Estate Prices
Author

Adam Cseresznye

Published

November 3, 2023

Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

Welcome to our project focusing on understanding the key factors that impact real estate property prices in Belgium. Our ultimate goal is to leverage data collected from immoweb.be, a prominent real estate platform in the country, to predict house prices in Belgium.

Note

You can access the project’s app through its Streamlit website.

The app is divided into three sections:

  1. Intro: This section provides a basic overview of how the project is structured, how data is handled, and how the models are trained.

  2. Explore Data: In this section, you can explore the data interactively using boxplots and scatter plots. It allows you to visualize how specific variables impact property prices.

  3. Make Predictions: On the “Make Predictions” page, you can input certain variables and generate your own predictions based on the latest trained model.

It’s worth noting that we maintain the app’s accuracy by regularly updating the data through GitHub Actions, which scrapes and retrains the model every month. To test your skills against my base test RMSE score, you can download and use the dataset I uploaded to my Kaggle account through Kaggle Datasets.

I’m excited to see what you can come up with using this tool. Feel free to explore and experiment with the app, and don’t hesitate to ask if you have any questions or need assistance with anything related to it.

In a series of blog posts, we will guide you through the thought process that led to the creation of the Streamlit application. Feel free to explore the topics that pique your interest or that you’d like to learn more about. We hope you’ll find this information valuable for your own projects. Let’s get started!

Note: Although the data collection process is not described in detail here, you can find the complete workflow in the src/main.py file, specifically focusing on the relevant functions and methods in src/data/make_dataset.py. Feel free to explore it further. In summary, we utilized the request_html library to scrape all available data, which we will show you how to process in subsequent notebooks.

How to import your own module using a .pth file

In case you encounter difficulties importing your own modules, I found this Stack Overflow question to be quite helpful. To resolve this issue, you can follow these steps:

  1. Create a .pth file that contains the path to the folder where your module is located. For example, prepare a .pth file with the content: C:\Users\myname\house_price_prediction\src.

  2. Place this .pth file into the following folder: C:\Users\myname\AppData\Roaming\Python\Python310\site-packages. This folder is already included in your PYTHONPATH, allowing Python to recognize your package directory.

  3. To verify what folders are in your PYTHONPATH, you can check it using the import sys and sys.path commands.

Once you’ve completed these steps, you’ll be able to import the utils module with the following statement: `from data importach out.

Import data

Code
import time
from pathlib import Path

import pandas as pd
from data import utils
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

Select Columns to Retain Based on the Quantity of Missing Values

In the realm of web scraping, managing the sheer volume of data is often the initial hurdle to conquer. It’s not so much about deciding what data to collect but rather what data to retain. As we delve into the data collected from the Imoweb website, we are met with a plethora of listings, each offering a unique set of information.

For many of these listings, there are commonalities – details like location and price tend to be constants. However, interspersed among them are those one-of-a-kind nuggets of information, such as the number of swimming pools available that obviously will be unique to certain listings. While these specific details can certainly be vital in assessing the value of certain listings, the downside is that they can lead to a sparse dataset.

Now, let’s import our initial dataset to examine the features that are commonly shared among most ads, i.e., those that are filled in most frequently. After identifying these common attributes, we can optimize our data collection process by keeping these key characteristics and removing the less common ones.

Code
df = pd.read_parquet(
    utils.Configuration.RAW_DATA_PATH.joinpath(
        "complete_dataset_2023-09-27_for_NB2.parquet.gzip"
    )
)

As depicted in Figure 1, the features ‘day of retrieval,’ ‘url,’ and ‘Energy Class’ demonstrate the highest completeness, with more than 90% of instances being present. In contrast, ‘dining room,’ ‘office,’ and ‘TV cable’ are among the least populated features, with roughly 10-20% of non-missing instances.

This information allows us to devise a strategy where we, for example, could retain features with a completeness of over 50%. We will delve deeper into this question in our subsequent notebooks.

Code
# Getting the column names with lowest missing values
lowest_missing_value_columns = (
    df.notna()
    .sum()
    .div(df.shape[0])
    .mul(100)
    .sort_values(ascending=False)
    .head(50)
    .round(1)
)
indexes_to_keep = lowest_missing_value_columns.index

(
    lowest_missing_value_columns.reset_index()
    .rename(columns={"index": "column", 0: "perc_values_present"})
    .assign(
        Has_non_missing_values_above_50_pct=lambda df: df.perc_values_present.gt(50),
        perc_values_present=lambda df: df.perc_values_present - 50,
    )
    .pipe(
        lambda df: ggplot(
            df,
            aes(
                "perc_values_present",
                "column",
                fill="Has_non_missing_values_above_50_pct",
            ),
        )
        + geom_bar(stat="identity", orientation="y", show_legend=False)
        + ggsize(800, 1000)
        + labs(
            title="Top 50 Features with Non-Missing Values Above 50%",
            subtitle="""The plot illustrates that the features 'day of retrieval,' 'url,' and 'Energy Class' exhibited the 
            highest completeness, with over 90% of instances present. Conversely, 'dining room','office,' and 'TV cable' 
            were among the least populated features, with approximately 10-20% of non-missing instances.
            """,
            x="Percentage of Instances Present with Reference Point at 50%",
            y="",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
    )
)
Figure 1: Top 50 Features with Non-Missing Values Above 50%

That’s all for now. In part 2, we will examine the downloaded raw data and investigate the error messages we encountered during the web scraping process with the goal of understanding how to overcome these challenges. See you in the next installment!