Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

In part 1, we provided a brief introduction to the project’s purpose. Now, in part 2, we will dive deeper into the data processing steps required after scraping. We will discuss the handling of numerical data, categorical variables, and boolean values. Additionally, we’ll assess the data quality by examining the error log generated by the Immowebscraper class. Let’s get to it!

Note

You can access the project’s app through its Streamlit website.

Import data

Code

import time
from pathlib import Path

import numpy as np
import pandas as pd
from data import pre_process, utils
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

Data Pre-cleaning steps

In the upcoming section, we will delve into fundamental post-web scraping procedures. While scraping, we acquired 50 features, making our dataset information-rich. However, there’s significant work ahead due to the data’s lack of cleanliness. This entails tasks such as dtype conversion, column parsing to extract numerical values, and the transformation of Boolean values into binary variables.
Here is the dataset we have gathered through web scraping:

Code

for filename in utils.Configuration.RAW_DATA_PATH.glob("*.gzip"):
    if "for_NB2" in filename.stem:
        df = pd.read_parquet(filename)
print(df.shape)
df.head().style.set_sticky(axis=0)

(3906, 50)

	Available as of	Construction year	Building condition	Street frontage width	Number of frontages	Covered parking spaces	Outdoor parking spaces	Surroundings type	Living area	Living room surface	Dining room	Kitchen type	Kitchen surface	Bedrooms	Bedroom 1 surface	Bedroom 2 surface	Bedroom 3 surface	Bathrooms	Toilets	Office	Basement	Furnished	Surface of the plot	Connection to sewer network	Gas, water & electricity	Garden surface	TV cable	Primary energy consumption	Energy class	Reference number of the EPC report	CO₂ emission	Yearly theoretical total energy consumption	Heating type	Double glazing	Price	Cadastral income	Tenement building	Address	Website	External reference	day_of_retrieval	ad_url	As built plan	Latest land use designation	Proceedings for breach of planning regulations	Possible priority purchase right	Subdivision permit	Flood zone type	Planning permission obtained	Width of the lot on the street
0	After signing the deed	1975	Good	23 m	4	2	2	Living area (residential, urban or rural)	420 m² square meters	63 m² square meters	Yes	USA hyper equipped	21 m² square meters	5	28 m² square meters	24 m² square meters	21 m² square meters	4	5	Yes	Yes	No	4677 m² square meters	Not connected	Yes	4300 m² square meters	Yes	296 kWh/m² kilowatt hour per square meters	D	20210910012683	74 kg CO₂/m²	133062 kWh/year	Fuel oil	Yes	€ 1,225,000 1225000 €	€ 7,615 7615 €	No	drève richelle 96 1410 - Waterloo	http://www.bytheway.be	5530019	2023-09-27 14:02:10.370650	https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811	None	None	None	None	None	None	None	None
1	After signing the deed	1850	To renovate	9 m	2	1	None	Urban	113 m² square meters	None	Yes	Semi equipped	None	2	14 m² square meters	10 m² square meters	None	1	1	None	None	No	73 m² square meters	Connected	None	None	Yes	307 kWh/m² kilowatt hour per square meters	D	20230306014621	76 kg CO₂/m²	43587 kWh/year	Fuel oil	Yes	Make an offer starting from € 89,000 Make an offer starting from 89000 €	€ 541 541 €	No	Chaussée de l'Ourthe 65 6900 - Marche-en-Famenne	http://www.weinvest.be	5246367	2023-09-27 14:02:10.641758	https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929	No	Living area (residential, urban or rural)	No	No	No	None	None	10 m meters
2	After signing the deed	1949	To renovate	18 m	3	1	None	Isolated	139 m² square meters	10 m² square meters	Yes	Semi equipped	13 m² square meters	2	17 m² square meters	11 m² square meters	None	1	1	Yes	Yes	No	413 m² square meters	Connected	None	315 m² square meters	Yes	699 kWh/m² kilowatt hour per square meters	G	20230303013078	173 kg CO₂/m²	96913 kWh/year	Fuel oil	Yes	Make an offer starting from € 150,000 Make an offer starting from 150000 €	€ 689 689 €	No	Rue de la Wallonie 2A 4680 - Oupeye	http://www.nigel-immo.be	5534704	2023-09-27 14:02:10.905808	https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582	None	Living area (residential, urban or rural)	No	No	No	Non flood zone	None	18 m meters
3	After signing the deed	1899	Good	4.2 m	2	None	None	Urban	119 m² square meters	14 m² square meters	None	USA installed	17 m² square meters	3	14 m² square meters	14 m² square meters	10 m² square meters	2	2	None	None	No	55 m² square meters	Connected	Yes	None	Yes	246 kWh/m² kilowatt hour per square meters	C	20230622-0002923649-RES-1	4167 kg CO₂/m²	Not specified	Gas	Yes	€ 272,000 272000 €	€ 898 898 €	No	Sint-Denijslaan 1 9000 - Gent	http://www.immodavinci.be	5535455	2023-09-27 14:02:11.811691	https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392	None	Living area (residential, urban or rural)	No	Yes	No	Non flood zone	Yes	None
4	At delivery	2021	As new	7.3 m	3	None	1	Living area (residential, urban or rural)	215 m² square meters	46 m² square meters	None	USA hyper equipped	None	3	16 m² square meters	12 m² square meters	12 m² square meters	1	2	None	None	No	330 m² square meters	Connected	Yes	250 m² square meters	Yes	27 kWh/m² kilowatt hour per square meters	A	12345	Not specified	Not specified	Gas	Yes	€ 413,150 413150 €	None	No	Astridlaan 1 1700 - Dilbeek	http://www.living-stone.be	5527171	2023-09-27 14:02:11.968969	https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333	None	Living area (residential, urban or rural)	None	No	Yes	Non flood zone	Yes	None

Based on the dataset, we’ve identified two primary tasks that need to be performed across multiple columns:

Handling Numerical Columns: This involves extracting numerical data using regex and converting it to float format.
Dealing with Binary Columns: Many columns contain binary values, such as “Yes” and “No.” We can easily convert these columns to boolean data types instead of string representations.
Special Handling for Certain Columns: Some columns, like “flood_zone_type” and “connection_to_sewer_network,” also have low cardinality and should be converted to boolean values. However, their values do not align with the typical “True” and “False” boolean mapping. Instead, they require a unique dictionary mapping compared to the other boolean columns.

The pre_process_dataframe function below serves as a solid starting point. Some may suggest that breaking it into multiple subfunctions could improve maintainability and enable unit testing. However, for the time being, we’ll maintain it in its current form.

Code

def pre_process_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Preprocesses a DataFrame by performing various data cleaning and transformation tasks.

    Args:
        df (pandas.DataFrame): The input DataFrame to be preprocessed.

    Returns:
        pandas.DataFrame: The preprocessed DataFrame.
    """

    def extract_numbers(df: pd.DataFrame, columns: list):
        """
        Extracts numeric values from specified columns in the DataFrame.

        Args:
            df (pandas.DataFrame): The DataFrame to extract values from.
            columns (list): List of column names to extract numeric values from.

        Returns:
            pandas.DataFrame: The DataFrame with extracted numeric values.
        """
        for column in columns:
            try:
                df[column] = df[column].str.extract(r"(\d+)").astype("float32")
            except Exception as e:
                print(f"Error processing column {column}: {e}")
        return df

    def map_values(df: pd.DataFrame, columns: list):
        """
        Maps boolean values in specified columns to True, False, or None.

        Args:
            df (pandas.DataFrame): The DataFrame to map values in.
            columns (list): List of column names with boolean values to be mapped.

        Returns:
            pandas.DataFrame: The DataFrame with mapped boolean values.
        """
        for column in columns:
            try:
                df[column] = df[column].map({"Yes": True, None: False, "No": False})
            except Exception as e:
                print(f"Error processing column {column}: {e}")
        return df

    number_columns = [
        "construction_year",
        "street_frontage_width",
        "number_of_frontages",
        "covered_parking_spaces",
        "outdoor_parking_spaces",
        "living_area",
        "living_room_surface",
        "kitchen_surface",
        "bedrooms",
        "bedroom_1_surface",
        "bedroom_2_surface",
        "bedroom_3_surface",
        "bathrooms",
        "toilets",
        "surface_of_the_plot",
        "width_of_the_lot_on_the_street",
        "garden_surface",
        "primary_energy_consumption",
        "co2_emission",
        "yearly_theoretical_total_energy_consumption",
    ]

    boolean_columns = [
        "basement",
        "furnished",
        "gas_water__electricity",
        "double_glazing",
        "planning_permission_obtained",
        "tv_cable",
        "dining_room",
        "proceedings_for_breach_of_planning_regulations",
        "subdivision_permit",
        "tenement_building",
        "possible_priority_purchase_right",
    ]

    return (
        df.sort_index(axis=1)
        .fillna(np.nan)
        .rename(
            columns=lambda column: column.lower()
            .replace(" ", "_")
            .replace("&", "")
            .replace(",", "")
        )
        .rename(columns={"co₂_emission": "co2_emission"})
        .pipe(lambda df: extract_numbers(df, number_columns))
        .pipe(lambda df: map_values(df, boolean_columns))
        .assign(
            flood_zone_type=lambda df: df.flood_zone_type.map(
                {
                    "Non flood zone": False,
                    "No": False,
                    "Possible flood zone": True,
                }
            ),
            connection_to_sewer_network=lambda df: df.connection_to_sewer_network.map(
                {
                    "Connected": True,
                    "Not connected": False,
                }
            ),
            as_built_plan=lambda df: df.as_built_plan.map(
                {
                    "Yes, conform": True,
                    "No": False,
                }
            ),
            cadastral_income=lambda df: df.cadastral_income.str.split(" ", expand=True)[
                3
            ].astype("float32"),
            price=lambda df: df.price.str.rsplit(" ", expand=True, n=2)[1].astype(
                float
            ),
        )
    )


df_pre_processed = pre_process_dataframe(df)
df_pre_processed.head().style.set_sticky(axis=0)

	address	as_built_plan	available_as_of	basement	bathrooms	bedroom_1_surface	bedroom_2_surface	bedroom_3_surface	bedrooms	building_condition	co2_emission	cadastral_income	connection_to_sewer_network	construction_year	covered_parking_spaces	dining_room	double_glazing	energy_class	external_reference	flood_zone_type	furnished	garden_surface	gas_water__electricity	heating_type	kitchen_surface	kitchen_type	latest_land_use_designation	living_area	living_room_surface	number_of_frontages	office	outdoor_parking_spaces	planning_permission_obtained	possible_priority_purchase_right	price	primary_energy_consumption	proceedings_for_breach_of_planning_regulations	reference_number_of_the_epc_report	street_frontage_width	subdivision_permit	surface_of_the_plot	surroundings_type	tv_cable	tenement_building	toilets	website	width_of_the_lot_on_the_street	yearly_theoretical_total_energy_consumption	ad_url	day_of_retrieval
0	drève richelle 96 1410 - Waterloo	nan	After signing the deed	True	4.000000	28.000000	24.000000	21.000000	5.000000	Good	74.000000	7615.000000	False	1975.000000	2.000000	True	True	D	5530019	nan	False	4300.000000	True	Fuel oil	21.000000	USA hyper equipped	nan	420.000000	63.000000	4.000000	Yes	2.000000	nan	nan	1225000.000000	296.000000	nan	20210910012683	23.000000	nan	4677.000000	Living area (residential, urban or rural)	True	False	5.000000	http://www.bytheway.be	nan	133062.000000	https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811	2023-09-27 14:02:10.370650
1	Chaussée de l'Ourthe 65 6900 - Marche-en-Famenne	False	After signing the deed	nan	1.000000	14.000000	10.000000	nan	2.000000	To renovate	76.000000	541.000000	True	1850.000000	1.000000	True	True	D	5246367	nan	False	nan	nan	Fuel oil	nan	Semi equipped	Living area (residential, urban or rural)	113.000000	nan	2.000000	nan	nan	nan	False	89000.000000	307.000000	False	20230306014621	9.000000	False	73.000000	Urban	True	False	1.000000	http://www.weinvest.be	10.000000	43587.000000	https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929	2023-09-27 14:02:10.641758
2	Rue de la Wallonie 2A 4680 - Oupeye	nan	After signing the deed	True	1.000000	17.000000	11.000000	nan	2.000000	To renovate	173.000000	689.000000	True	1949.000000	1.000000	True	True	G	5534704	False	False	315.000000	nan	Fuel oil	13.000000	Semi equipped	Living area (residential, urban or rural)	139.000000	10.000000	3.000000	Yes	nan	nan	False	150000.000000	699.000000	False	20230303013078	18.000000	False	413.000000	Isolated	True	False	1.000000	http://www.nigel-immo.be	18.000000	96913.000000	https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582	2023-09-27 14:02:10.905808
3	Sint-Denijslaan 1 9000 - Gent	nan	After signing the deed	nan	2.000000	14.000000	14.000000	10.000000	3.000000	Good	4167.000000	898.000000	True	1899.000000	nan	nan	True	C	5535455	False	False	nan	True	Gas	17.000000	USA installed	Living area (residential, urban or rural)	119.000000	14.000000	2.000000	nan	nan	True	True	272000.000000	246.000000	False	20230622-0002923649-RES-1	4.000000	False	55.000000	Urban	True	False	2.000000	http://www.immodavinci.be	nan	nan	https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392	2023-09-27 14:02:11.811691
4	Astridlaan 1 1700 - Dilbeek	nan	At delivery	nan	1.000000	16.000000	12.000000	12.000000	3.000000	As new	nan	nan	True	2021.000000	nan	nan	True	A	5527171	False	False	250.000000	True	Gas	nan	USA hyper equipped	Living area (residential, urban or rural)	215.000000	46.000000	3.000000	nan	1.000000	True	False	413150.000000	27.000000	nan	12345	7.000000	True	330.000000	Living area (residential, urban or rural)	True	False	2.000000	http://www.living-stone.be	nan	nan	https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333	2023-09-27 14:02:11.968969

A crucial task on our agenda is the thorough parsing of address information. This is where the separate_address function comes into play. While this function provides a solid foundation for parsing location details, we’ve found that achieving high accuracy using regex alone can be challenging. For this reason, we’ve chosen to leverage the Google Maps API to extract location details effectively.

The function that’s integrated into our final pipeline is get_location_details_from_google, which can be located in the pre_process.py module. This step is essential for extracting key details such as the city, ZIP code, house number, and stre reliably with high accuracyet. As we all know, location plays a vital role in real estate price estimation.

It’s worth mentioning that we’ve removed the original address field to eliminate redundant data and ensure our dataset is more streamlined and efficions.

Code

def separate_address(df: pd.DataFrame) -> pd.DataFrame:
    """Separates the address into city, street name, house number, and zip code.

    Args:
        df (pd.DataFrame): The DataFrame containing the address column.

    Returns:
        pd.DataFrame: The DataFrame with the address separated into different columns.
    """
    # Define a regular expression pattern to extract street, house number, and zip code
    pattern = r"(?P<street_name>.*?)\s*(?P<house_number>\d+\w*)?\s*(?P<zip>\d{4})"

    try:
        return df.assign(
            city=lambda df: df.address.str.rsplit("-", expand=True, n=1)[1].str.title(),
            **(lambda dfx: dfx.rename(columns={"address": "original_address"}))(
                df["address"].str.extract(pattern)
            ),
            street=lambda df: df.street_name.str.replace(
                r"[^a-zA-Z\s]", "", regex=True
            ),
        ).drop(columns=["street_name", "address"])
    except Exception as e:
        print(f"Error separating address: {e}")
        return df


finer_pre_cleaned = separate_address(df_pre_processed)
finer_pre_cleaned.head().style.set_sticky(axis=0)

	as_built_plan	available_as_of	basement	bathrooms	bedroom_1_surface	bedroom_2_surface	bedroom_3_surface	bedrooms	building_condition	co2_emission	cadastral_income	connection_to_sewer_network	construction_year	covered_parking_spaces	dining_room	double_glazing	energy_class	external_reference	flood_zone_type	furnished	garden_surface	gas_water__electricity	heating_type	kitchen_surface	kitchen_type	latest_land_use_designation	living_area	living_room_surface	number_of_frontages	office	outdoor_parking_spaces	planning_permission_obtained	possible_priority_purchase_right	price	primary_energy_consumption	proceedings_for_breach_of_planning_regulations	reference_number_of_the_epc_report	street_frontage_width	subdivision_permit	surface_of_the_plot	surroundings_type	tv_cable	tenement_building	toilets	website	width_of_the_lot_on_the_street	yearly_theoretical_total_energy_consumption	ad_url	day_of_retrieval	city	house_number	zip	street
0	nan	After signing the deed	True	4.000000	28.000000	24.000000	21.000000	5.000000	Good	74.000000	7615.000000	False	1975.000000	2.000000	True	True	D	5530019	nan	False	4300.000000	True	Fuel oil	21.000000	USA hyper equipped	nan	420.000000	63.000000	4.000000	Yes	2.000000	nan	nan	1225000.000000	296.000000	nan	20210910012683	23.000000	nan	4677.000000	Living area (residential, urban or rural)	True	False	5.000000	http://www.bytheway.be	nan	133062.000000	https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811	2023-09-27 14:02:10.370650	Waterloo	96	1410	drve richelle
1	False	After signing the deed	nan	1.000000	14.000000	10.000000	nan	2.000000	To renovate	76.000000	541.000000	True	1850.000000	1.000000	True	True	D	5246367	nan	False	nan	nan	Fuel oil	nan	Semi equipped	Living area (residential, urban or rural)	113.000000	nan	2.000000	nan	nan	nan	False	89000.000000	307.000000	False	20230306014621	9.000000	False	73.000000	Urban	True	False	1.000000	http://www.weinvest.be	10.000000	43587.000000	https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929	2023-09-27 14:02:10.641758	Famenne	65	6900	Chausse de lOurthe
2	nan	After signing the deed	True	1.000000	17.000000	11.000000	nan	2.000000	To renovate	173.000000	689.000000	True	1949.000000	1.000000	True	True	G	5534704	False	False	315.000000	nan	Fuel oil	13.000000	Semi equipped	Living area (residential, urban or rural)	139.000000	10.000000	3.000000	Yes	nan	nan	False	150000.000000	699.000000	False	20230303013078	18.000000	False	413.000000	Isolated	True	False	1.000000	http://www.nigel-immo.be	18.000000	96913.000000	https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582	2023-09-27 14:02:10.905808	Oupeye	2A	4680	Rue de la Wallonie
3	nan	After signing the deed	nan	2.000000	14.000000	14.000000	10.000000	3.000000	Good	4167.000000	898.000000	True	1899.000000	nan	nan	True	C	5535455	False	False	nan	True	Gas	17.000000	USA installed	Living area (residential, urban or rural)	119.000000	14.000000	2.000000	nan	nan	True	True	272000.000000	246.000000	False	20230622-0002923649-RES-1	4.000000	False	55.000000	Urban	True	False	2.000000	http://www.immodavinci.be	nan	nan	https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392	2023-09-27 14:02:11.811691	Gent	1	9000	SintDenijslaan
4	nan	At delivery	nan	1.000000	16.000000	12.000000	12.000000	3.000000	As new	nan	nan	True	2021.000000	nan	nan	True	A	5527171	False	False	250.000000	True	Gas	nan	USA hyper equipped	Living area (residential, urban or rural)	215.000000	46.000000	3.000000	nan	1.000000	True	False	413150.000000	27.000000	nan	12345	7.000000	True	330.000000	Living area (residential, urban or rural)	True	False	2.000000	http://www.living-stone.be	nan	nan	https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333	2023-09-27 14:02:11.968969	Dilbeek	1	1700	Astridlaan

Inspecting data quality

Reading in and inspecting the log file

Analyzing the error log file, we’ve identified a total of 3,515 errors encountered during the web scraping process on the Immoweb website. Let’s delve into these errors to pinpoint the most common issues and address them accordingly.

Code

error_log = pd.read_table(
    utils.Configuration.RAW_DATA_PATH.joinpath("make_dataset_error_for_NB2.log"),
    header=None,
).rename(columns={0: "error"})

error_log

	error
0	2023-09-27 14:02:11 - ERROR - No tables found ...
1	2023-09-27 14:02:17 - ERROR - No tables found ...
2	2023-09-27 14:02:22 - ERROR - No tables found ...
3	2023-09-27 14:02:25 - ERROR - No tables found ...
4	2023-09-27 14:02:34 - ERROR - No tables found ...
...	...
3510	2023-09-27 16:43:45 - ERROR - Duplicate labels...
3511	2023-09-27 16:43:57 - ERROR - No tables found ...
3512	2023-09-27 16:43:57 - ERROR - Duplicate labels...
3513	2023-09-27 16:43:57 - ERROR - Duplicate labels...
3514	2023-09-27 16:44:03 - ERROR - No tables found ...

3515 rows × 1 columns

Most common errors from log file

It’s clear that a significant majority of the errors, accounting for 1,848 cases, result from the absence of tables on the pages. These errors are primarily found on listing ads and index pages. To address this issue, we’ve introduced an if clause into our method extract_ads_from_given_page, which can be found in the make_dataset.py module. The clause, if "immoweb.be" in item and "https://www.immoweb.be/en/search" not in item, enables us to filter out undesired pages that don’t contain relevant table information for our ads. This not only helps mitigate errors but also speeds up the dataset collection process by reducing the number of pages we scrape.

Another category of errors, totaling 1,460 cases, is related to the presence of duplicate labels during processing. We may need to investigate this issue further at a later stage to ensure data quality and accuracy.

A smaller proportion of errors is linked to the “Empty data” message, primarily related to ads. Finally, the remaining errors encompass errorrs related to data type conversion. We can consider either leaving these columns as is, since the error is not that frequent, or removing these features altogether.

Code

(
    error_log.error.str.split("-", expand=True)[4]
    .str.rsplit(" ", n=1, expand=True)[0]
    .value_counts()
)

0
 No tables found while processing                                                                                                        1848
 Duplicate labels found while processing                                                                                                 1460
 Empty data while processing                                                                                                               18
 An error occurred on page 327: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 255: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
                                                                                                                                         ... 
 An error occurred on page 173: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 174: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 176: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 177: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 331: ("Expected bytes, got a 'int' object", 'Conversion failed for column Outdoor parking spaces with type       1
Name: count, Length: 192, dtype: int64

We’ve identified five columns responsible for the last set of data type conversion problems (See below):

Construction year
Number of frontages
Outdoor parking spaces
Covered parking spaces
Bedrooms

Code

(
    error_log.error.str.split("-", expand=True)[4]
    .str.rsplit(" ", n=1, expand=True)[0]
    .value_counts()[3:]
    .to_frame()
    .reset_index()
    .rename(columns={0: "error_type"})
    .assign(
        error_type=lambda df: df.error_type.str.split(",", expand=True)[2].str.split(
            " ", n=5, expand=True
        )[5]
    )
    .error_type.value_counts()
)

error_type
Construction year with type         83
Number of frontages with type       65
Outdoor parking spaces with type     7
Covered parking spaces with type     7
Bedrooms with type                   1
Name: count, dtype: int64

Unique URLs from the error logs

Upon conducting a more comprehensive analysis of the URLs extracted from error messages, a noteworthy observation comes to light: we’ve encountered only 433 unique URLs. This suggests that the 3,515 errors are stemming from a relatively restricted set of web addresses.

Now, here’s a question that arises: the website implies the presence of 10,000 ads on the page. However, given our successful extraction of only 3,906 ads, along with the 433 URLs associated with errors, there is a substantial disparity evident.

If you have any insights or hypotheses regarding this difference, I’d be eager to hear your thoughts and discuss potential reasons for this variation.

Code

(
    error_log.error.str.split("-", expand=True)[4]
    .str.rsplit(" ", n=1, expand=True)[1]
    .unique()
    .shape
)

(433,)

Inspecting the data itself

After eliminating rows where all values were missing and filtering for rows with non-missing prices, we’ve successfully refined our dataset to include 3,660 ads.

In our subsequent analysis, we focus on the features with the lowest percentage of missing da, just like we did in Part 1ta. Notably, “day of retrieval” and “price” are complete, with all values present. However, it’s important to recognize that roughly one-third of the data related to “dining_room” and “office” is missing, highlighting the need for improving data completeness in these specific attributet.s.

Code

(
    finer_pre_cleaned.dropna(axis=0, how="all")
    .query("price.notna()")
    .notna()
    .sum()
    .sort_values()
    .div(3660)
    .mul(100)
    .round(1)
)

dining_room                                        29.3
office                                             29.5
planning_permission_obtained                       32.1
tv_cable                                           34.0
proceedings_for_breach_of_planning_regulations     36.3
subdivision_permit                                 38.2
yearly_theoretical_total_energy_consumption        39.5
width_of_the_lot_on_the_street                     42.4
co2_emission                                       43.0
connection_to_sewer_network                        44.7
possible_priority_purchase_right                   45.5
street_frontage_width                              46.5
basement                                           46.6
as_built_plan                                      47.0
latest_land_use_designation                        47.1
garden_surface                                     47.4
outdoor_parking_spaces                             48.0
furnished                                          48.4
surroundings_type                                  50.8
flood_zone_type                                    54.4
bedroom_3_surface                                  54.8
covered_parking_spaces                             54.9
kitchen_surface                                    58.2
available_as_of                                    58.3
living_room_surface                                64.7
bedroom_2_surface                                  66.6
gas_water__electricity                             67.0
bedroom_1_surface                                  68.2
construction_year                                  71.0
cadastral_income                                   78.3
kitchen_type                                       83.4
double_glazing                                     84.6
website                                            84.6
heating_type                                       87.2
external_reference                                 90.3
toilets                                            90.8
number_of_frontages                                94.6
bathrooms                                          94.9
house_number                                       94.9
primary_energy_consumption                         96.0
building_condition                                 96.0
surface_of_the_plot                                96.6
living_area                                        98.2
tenement_building                                  99.1
zip                                                99.5
city                                               99.5
bedrooms                                           99.5
street                                             99.5
reference_number_of_the_epc_report                 99.9
energy_class                                       99.9
price                                             100.0
ad_url                                            100.0
day_of_retrieval                                  100.0
dtype: float64

Our filter_out_missing_indexes function proves to be quite valuable in the post-processing of our scraped data. This function is located at the end of our pre-process chain saving our data to the INTERIM_DATA_PATH folder after we’ve completed pre-processing and removed missing values.

Code

def filter_out_missing_indexes(
    df: pd.DataFrame,
    filepath: Path = utils.Configuration.INTERIM_DATA_PATH.joinpath(
        f"{str(pd.Timestamp.now())[:10]}_Processed_dataset.parquet.gzip"
    ),
) -> pd.DataFrame:
    """
    Filter out rows with missing values in a DataFrame and save the processed dataset.

    This function filters out rows with all missing values (NaN) and retains only rows
    with non-missing values in the 'price' column. The resulting DataFrame is then saved
    in Parquet format with gzip compression.

    Args:
        df (pd.DataFrame): The input DataFrame.
        filepath (Path, optional): The path to save the processed dataset in Parquet format.
            Defaults to a timestamp-based filepath in the interim data directory.

    Returns:
        pd.DataFrame: The filtered DataFrame with missing rows removed.

    Example:
        To filter out missing rows and save the processed dataset:
        >>> data = pd.read_csv("raw_data.csv")
        >>> filtered_data = filter_out_missing_indexes(data)
        >>> print(filtered_data.head())

    Notes:
        - Rows with missing values in any column other than 'price' are removed.
        - The processed dataset is saved with gzip compression to conserve disk space.
    """
    processed_df = df.dropna(axis=0, how="all").query("price.notna()")
    processed_df.to_parquet(filepath, compression="gzip", index=False)
    return processed_df

It appears that we’ve successfully completed the data transformation phase of our scraped dataset. With the implementation of functions like filter_out_missing_indexes, alongside pre_process_dataframe and separate_address, we’ve assembled the essential tools required for preparing our dataset for the machine learning pipeline.

In Part 3, we’ll provide a fundamental overview and characterization of the cleaned scraped data. We’ll assess feature cardinality, examine distributions, and explore correlations among variables. I look forward to delving into these insights with you in the next installment. See you there!