Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

In the second part, we delved into the intricacies of data processing necessary after scraping. Our discussion covered the treatment of numerical data, handling of categorical variables, and management of boolean values. Furthermore, we evaluated the data quality by scrutinizing the error log produced by the Immowebscraper class. In the upcoming Part 3, our focus will shift to getting a fundamental overview of our data by characterizing the cleaned scraped data. Additionally, we aim to assess feature cardinality, scrutinize distributions, and explore potential correlations between features and our target variable—property price.

Note

You can access the project’s app through its Streamlit website.

Import data

Code

import time
from pathlib import Path

import creds
import numpy as np
import pandas as pd
from data import pre_process, utils
from lets_plot import *
from lets_plot.bistro.corr import *
from lets_plot.geo_data import *
from lets_plot.mapping import as_discrete
from tqdm import tqdm

LetsPlot.setup_html(offline=True)

The geodata is provided by © OpenStreetMap contributors and is made available here under the Open Database License (ODbL).

Data Inspection

Our initial step involves reading the cleaned-up dataset that was initially scraped. This dataset is stored in the interim folder following the cleaning process, and we will access it from there.

Code

df = pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "2023-10-01_Processed_dataset_for_NB_use.parquet.gzip"
    )
)
df.head().style.set_sticky(axis=0)

	as_built_plan	available_as_of	basement	bathrooms	bedroom_1_surface	bedroom_2_surface	bedroom_3_surface	bedrooms	building_condition	co2_emission	cadastral_income	connection_to_sewer_network	construction_year	covered_parking_spaces	dining_room	double_glazing	energy_class	external_reference	flood_zone_type	garden_surface	gas_water__electricity	heating_type	kitchen_surface	kitchen_type	latest_land_use_designation	living_area	living_room_surface	number_of_frontages	office	outdoor_parking_spaces	planning_permission_obtained	possible_priority_purchase_right	price	primary_energy_consumption	proceedings_for_breach_of_planning_regulations	reference_number_of_the_epc_report	street_frontage_width	subdivision_permit	surface_of_the_plot	surroundings_type	tv_cable	toilets	website	width_of_the_lot_on_the_street	yearly_theoretical_total_energy_consumption	ad_url	day_of_retrieval	housenumber	street	city	postal	state	lat	lng
0	nan	After signing the deed	1.000000	4.000000	28.000000	24.000000	21.000000	5.000000	Good	74.000000	7615.000000	0.000000	1975.000000	2.000000	1.000000	1.000000	D	5530019	nan	4300.000000	1.000000	Fuel oil	21.000000	USA hyper equipped	None	420.000000	63.000000	4.000000	1.000000	2.000000	nan	nan	1225000.000000	296.000000	nan	20210910012683	23.000000	nan	4677.000000	Living area (residential, urban or rural)	1.000000	5.000000	http://www.bytheway.be	nan	133062.000000	https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811	2023-09-27 14:02:10.370650	96	Drève Richelle	Waterloo	1410	Région Wallonne	50.708475	4.407577
1	0.000000	After signing the deed	nan	1.000000	14.000000	10.000000	nan	2.000000	To renovate	76.000000	541.000000	1.000000	1850.000000	1.000000	1.000000	1.000000	D	5246367	nan	nan	nan	Fuel oil	nan	Semi equipped	Living area (residential, urban or rural)	113.000000	nan	2.000000	nan	nan	nan	0.000000	89000.000000	307.000000	0.000000	20230306014621	9.000000	0.000000	73.000000	Urban	1.000000	1.000000	http://www.weinvest.be	10.000000	43587.000000	https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929	2023-09-27 14:02:10.641758	65	Chau. de l'Ourthe	Marche-en-Famenne	6900	Région Wallonne	50.231363	5.350163
2	nan	After signing the deed	1.000000	1.000000	17.000000	11.000000	nan	2.000000	To renovate	173.000000	689.000000	1.000000	1949.000000	1.000000	1.000000	1.000000	G	5534704	0.000000	315.000000	nan	Fuel oil	13.000000	Semi equipped	Living area (residential, urban or rural)	139.000000	10.000000	3.000000	1.000000	nan	nan	0.000000	150000.000000	699.000000	0.000000	20230303013078	18.000000	0.000000	413.000000	Isolated	1.000000	1.000000	http://www.nigel-immo.be	18.000000	96913.000000	https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582	2023-09-27 14:02:10.905808	2a	Rue de la Wallonie	Oupeye	4680	Région Wallonne	50.708887	5.622615
3	nan	After signing the deed	nan	2.000000	14.000000	14.000000	10.000000	3.000000	Good	4167.000000	898.000000	1.000000	1899.000000	nan	nan	1.000000	C	5535455	0.000000	nan	1.000000	Gas	17.000000	USA installed	Living area (residential, urban or rural)	119.000000	14.000000	2.000000	nan	nan	1.000000	1.000000	272000.000000	246.000000	0.000000	20230622-0002923649-RES-1	4.000000	0.000000	55.000000	Urban	1.000000	2.000000	http://www.immodavinci.be	nan	nan	https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392	2023-09-27 14:02:11.811691	1	Sint-Denijslaan	Gent	9000	Vlaams Gewest	51.033681	3.715034
4	nan	At delivery	nan	1.000000	16.000000	12.000000	12.000000	3.000000	As new	nan	nan	1.000000	2021.000000	nan	nan	1.000000	A	5527171	0.000000	250.000000	1.000000	Gas	nan	USA hyper equipped	Living area (residential, urban or rural)	215.000000	46.000000	3.000000	nan	1.000000	1.000000	0.000000	413150.000000	27.000000	nan	12345	7.000000	1.000000	330.000000	Living area (residential, urban or rural)	1.000000	2.000000	http://www.living-stone.be	nan	nan	https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333	2023-09-27 14:02:11.968969	1	Astridlaan	Dilbeek	1700	Vlaams Gewest	50.850220	4.265908

To examine the data types of the columns, we will utilize the info() method.

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3660 entries, 0 to 3659
Data columns (total 56 columns):
 #   Column                                          Non-Null Count  Dtype         
---  ------                                          --------------  -----         
 0   as_built_plan                                   1722 non-null   float64       
 1   available_as_of                                 2134 non-null   object        
 2   basement                                        1707 non-null   float64       
 3   bathrooms                                       3472 non-null   float32       
 4   bedroom_1_surface                               2495 non-null   float32       
 5   bedroom_2_surface                               2439 non-null   float32       
 6   bedroom_3_surface                               2005 non-null   float32       
 7   bedrooms                                        3641 non-null   float32       
 8   building_condition                              3515 non-null   object        
 9   co2_emission                                    1575 non-null   float32       
 10  cadastral_income                                2865 non-null   float32       
 11  connection_to_sewer_network                     1635 non-null   float64       
 12  construction_year                               2598 non-null   float32       
 13  covered_parking_spaces                          2009 non-null   float32       
 14  dining_room                                     1073 non-null   float64       
 15  double_glazing                                  3096 non-null   float64       
 16  energy_class                                    3658 non-null   object        
 17  external_reference                              3305 non-null   object        
 18  flood_zone_type                                 1991 non-null   float64       
 19  furnished                                       1773 non-null   float64       
 20  garden_surface                                  1736 non-null   float32       
 21  gas_water__electricity                          2454 non-null   float64       
 22  heating_type                                    3191 non-null   object        
 23  kitchen_surface                                 2129 non-null   float32       
 24  kitchen_type                                    3054 non-null   object        
 25  latest_land_use_designation                     1725 non-null   object        
 26  living_area                                     3595 non-null   float32       
 27  living_room_surface                             2367 non-null   float32       
 28  number_of_frontages                             3463 non-null   float32       
 29  office                                          1080 non-null   float64       
 30  outdoor_parking_spaces                          1757 non-null   float32       
 31  planning_permission_obtained                    1176 non-null   float64       
 32  possible_priority_purchase_right                1667 non-null   float64       
 33  price                                           3660 non-null   float64       
 34  primary_energy_consumption                      3512 non-null   float32       
 35  proceedings_for_breach_of_planning_regulations  1328 non-null   float64       
 36  reference_number_of_the_epc_report              3658 non-null   object        
 37  street_frontage_width                           1701 non-null   float32       
 38  subdivision_permit                              1398 non-null   float64       
 39  surface_of_the_plot                             3536 non-null   float32       
 40  surroundings_type                               1861 non-null   object        
 41  tv_cable                                        1244 non-null   float64       
 42  tenement_building                               3628 non-null   float64       
 43  toilets                                         3325 non-null   float32       
 44  website                                         3096 non-null   object        
 45  width_of_the_lot_on_the_street                  1553 non-null   float32       
 46  yearly_theoretical_total_energy_consumption     1447 non-null   float32       
 47  ad_url                                          3660 non-null   object        
 48  day_of_retrieval                                3660 non-null   datetime64[us]
 49  housenumber                                     3577 non-null   object        
 50  street                                          3612 non-null   object        
 51  city                                            3634 non-null   object        
 52  postal                                          3641 non-null   object        
 53  state                                           3641 non-null   object        
 54  lat                                             3641 non-null   float64       
 55  lng                                             3641 non-null   float64       
dtypes: datetime64[us](1), float32(21), float64(18), object(16)
memory usage: 1.3+ MB

Assessing Feature Cardinality

Now, let’s assess the feature cardinality of our dataset to differentiate between categorical and numerical data. Initially, we will analyze the percentage of unique values per feature, followed by displaying the absolute number of unique values per feature.

Code

# Assuming df is your DataFrame
number_unique_entries = {
    "column_name": df.columns.tolist(),
    "column_dtype": [df[col].dtype for col in df.columns],
    "unique_values_pct": [df[col].nunique() for col in df.columns],
}

(
    pd.DataFrame(number_unique_entries)
    .sort_values("unique_values_pct")
    .assign(
        unique_values_pct=lambda x: x.unique_values_pct.div(df.shape[0])
        .mul(100)
        .round(1)
    )
    .pipe(
        lambda df: ggplot(df, aes("unique_values_pct", "column_name"))
        + geom_bar(stat="identity", orientation="y")
        + labs(
            title="Assessing Feature Cardinality",
            subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be  Utilized as Categorical Variables, 
            while Those with Higher Cardinality, typically represented as floats or integers, May Be Used as They Are
            """,
            x="Percentage of Unique Values per Feature",
            y="",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 1000)
    )
)

Figure 1: Assessing Feature Cardinality: Percentage of Unique Values per Feature

Code

(
    pd.DataFrame(number_unique_entries)
    .sort_values("unique_values_pct")
    .pipe(
        lambda df: ggplot(df, aes("unique_values_pct", "column_name"))
        + geom_bar(stat="identity", orientation="y")
        + labs(
            title="Assessing Feature Cardinality",
            subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be  Utilized as Categorical Variables, 
            while Those with Higher Cardinality, typically represented as floats or integers, May Be Used as They Are
            """,
            x="Number of Unique Values per Feature",
            y="",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 1000)
    )
)

Figure 2: Assessing Feature Cardinality: Number of Unique Values per Feature

Looking at distributions

Distribution of features

Next, our focus will be on identifying low and high cardinality features. Subsequently, we will investigate how house prices vary when grouped according to these variables, using boxplots. Please take into account that the price values have undergone log transformation to address skewness.

Code

low_cardinality_features = (
    pd.DataFrame(number_unique_entries)
    .query("unique_values_pct <= 5")
    .column_name.to_list()
)

Code

high_cardinality_features = (
    pd.DataFrame(number_unique_entries)
    .query("(unique_values_pct >= 5)")
    .loc[lambda df: (df.column_dtype == "float32") | (df.column_dtype == "float64"), :]
    .column_name.to_list()
)

Code

plots = []

for idx, feature in enumerate(low_cardinality_features):
    plot = (
        df.melt(id_vars=["ad_url", "price"])
        .loc[lambda df: df.variable == feature, :]
        .assign(price=lambda df: np.log10(df.price))
        .pipe(
            lambda df: ggplot(
                df,
                aes(as_discrete("value"), "price"),
            )
            + facet_wrap("variable")
            + geom_boxplot(
                show_legend=False,
            )
        )
    )
    plots.append(plot)
gggrid(plots, ncol=4) + ggsize(900, 1600)

Figure 3: Exploring Price Variations Across Different Variables

Distribution of target

Upon examining the distribution of our target variable, which is the price, it becomes evident that there is a notable skew. Our median price stands at 379,000 EUR, with the lowest at 350,000 EUR and the highest reaching 10 million EUR. To increase the accuracy of our predictions, it is worth considering a transformation of our target variable before proceeding with modeling. This transformation serves several beneficial purposes:

Normalization: It has the potential to render the distribution of the target variable more symmetrical, resembling a normal distribution. Such a transformation can significantly enhance the performance of various regression models.
Equalizing Variance: By stabilizing the variance of the target variable across different price ranges, this transformation becomes particularly valuable for ensuring the effectiveness of certain regression algorithms.
Mitigating Outliers: It is effective at diminishing the impact of extreme outliers, bolstering the model’s robustness against data anomalies.
Interpretability: Notably, when interpreting model predictions, this transformation allows for straightforward back-transformation to the original scale. This can be achieved using a base 10 exponentiation, ensuring that predictions are easily interpretable in their origination task.

Code

before_transformation = df.pipe(
    lambda df: ggplot(df, aes("price")) + geom_histogram()
) + labs(
    title="Before Transformation",
)
after_transformation = df.assign(price=lambda df: np.log10(df.price)).pipe(
    lambda df: ggplot(df, aes("price"))
    + geom_histogram()
    + labs(
        title="After log10 Transformation",
    )
)
gggrid([before_transformation, after_transformation], ncol=2) + ggsize(800, 500)

Figure 4: Target distribution before and after log10 transformation

Geomapping

Next, let’s explore how median house prices vary across different locations. As indicated by the map below, Knokke-Heist boasts one of the highest median apartment prices, closely followed by Watermael-Boitsfort. This aligns with expectations, as per Wikipedia: “Watermael-Boitsfort is one of Brussels’ most affluent municipalities, with an average per capita income of €30,100 in 2002, exceeding the regional average for the Brussels-Capital Region by over €600.” On the other hand, Knokke-Heist is renowned as “It is Belgium’s best-known and most affluent seaside resort.” There you have it.

Code

grouped_city = (
    df.groupby("city")
    .price.median()
    .to_frame()
    .reset_index()
    .query("~city.isin(['Blégny', 'Woluwe-St.-Lambert', 'Wolvertem'])")
)


boundaries = (
    geocode_cities(grouped_city.city)
    .scope("Belgium")
    .inc_res(4)
    .get_boundaries(resolution=6)
)

(
    ggplot()
    + geom_livemap()
    + geom_polygon(
        aes(color="price", fill="price"),
        data=grouped_city,
        map=boundaries,
        color="#dbd6d6",
        alpha=0.9,
        size=0.5,
        map_join=[["city"], ["city"]],
        tooltips=layer_tooltips(["city", "price"]),
    show_legend=False
    )
    + labs(
        "Median House Prices in Belgium Cities",
        subtitle="Knokke-Heist boasts one of the highest median apartment prices, closely followed by Watermael-Boitsfort.",
        caption="https://www.immoweb.be/",
    )
    + theme(
        axis_title="blank",
        axis_text="blank",
        axis_ticks="blank",
        axis_line="blank",
        plot_subtitle=element_text(size=12, face="italic"),
        plot_title=element_text(size=15, face="bold"),
    )
    + scale_fill_gradient(low="green", high="red")
    + labs(fill="Median Price")
    + ggsize(1000, 800)
)

Correlations

Finally, we will delve into the correlations among variables with high cardinality through Spearman correlation analysis, with a particular focus on the price as the target variable. As evident from the heatmap, the price exhibits a strong correlation with cadastral income (correlation coefficient = 0.77), living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59). For your reference, cadastral income is an annual Flemish property tax based on the assessed rental income of immovable properties in the Flemish Region. This income is a notional rental income assigned to each property, whether it is rented out or not.

Code

(
    df.loc[:, lambda df: df.columns.isin(high_cardinality_features)]
    .corr(method="spearman")
    .pipe(
        lambda df: corr_plot(df)
        .tiles(
            "lower",
        )
        .labels(type="lower", map_size=False)
        .palette_gradient(low="#2986cc", mid="#ffffff", high="#d73027")
        .build()
        + ggsize(900, 900)
        + labs(
            title="Spearman Correlations Among High Cardinality Features",
            subtitle=""" The price demonstrates robust correlations with key factors, including cadastral income (correlation coefficient = 0.77), 
            living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59)
            """,
            x="Number of Unique Values per Feature",
            y="",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
    )
)

Figure 5: Spearman Correlations Among High Cardinality Features

In part 3, we’ve performed some initial data exploration, and our next stride entails constructing a foundational machine learning model. In part 4, we’ll compare various algorithms to establish a benchmark for our subsequent endeavors in model building and feature engineering. This baseline model will serve as a reference point, guiding us as we strive to enhance our predictive capabilities. Looking forward to meeting you again in Part 4!