Code
import time
from pathlib import Path
import numpy as np
import pandas as pd
from data import pre_process, utils
from lets_plot import *
from lets_plot.mapping import as_discrete
LetsPlot.setup_html()
Adam Cseresznye
November 4, 2023
In part 1, we provided a brief introduction to the project’s purpose. Now, in part 2, we will dive deeper into the data processing steps required after scraping. We will discuss the handling of numerical data, categorical variables, and boolean values. Additionally, we’ll assess the data quality by examining the error log generated by the Immowebscraper
class. Let’s get to it!
You can access the project’s app through its Streamlit website.
In the upcoming section, we will delve into fundamental post-web scraping procedures. While scraping, we acquired 50 features, making our dataset information-rich. However, there’s significant work ahead due to the data’s lack of cleanliness. This entails tasks such as dtype conversion, column parsing to extract numerical values, and the transformation of Boolean values into binary variables.
Here is the dataset we have gathered through web scraping:
(3906, 50)
Available as of | Construction year | Building condition | Street frontage width | Number of frontages | Covered parking spaces | Outdoor parking spaces | Surroundings type | Living area | Living room surface | Dining room | Kitchen type | Kitchen surface | Bedrooms | Bedroom 1 surface | Bedroom 2 surface | Bedroom 3 surface | Bathrooms | Toilets | Office | Basement | Furnished | Surface of the plot | Connection to sewer network | Gas, water & electricity | Garden surface | TV cable | Primary energy consumption | Energy class | Reference number of the EPC report | CO₂ emission | Yearly theoretical total energy consumption | Heating type | Double glazing | Price | Cadastral income | Tenement building | Address | Website | External reference | day_of_retrieval | ad_url | As built plan | Latest land use designation | Proceedings for breach of planning regulations | Possible priority purchase right | Subdivision permit | Flood zone type | Planning permission obtained | Width of the lot on the street | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | After signing the deed | 1975 | Good | 23 m | 4 | 2 | 2 | Living area (residential, urban or rural) | 420 m² square meters | 63 m² square meters | Yes | USA hyper equipped | 21 m² square meters | 5 | 28 m² square meters | 24 m² square meters | 21 m² square meters | 4 | 5 | Yes | Yes | No | 4677 m² square meters | Not connected | Yes | 4300 m² square meters | Yes | 296 kWh/m² kilowatt hour per square meters | D | 20210910012683 | 74 kg CO₂/m² | 133062 kWh/year | Fuel oil | Yes | € 1,225,000 1225000 € | € 7,615 7615 € | No | drève richelle 96 1410 - Waterloo | http://www.bytheway.be | 5530019 | 2023-09-27 14:02:10.370650 | https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811 | None | None | None | None | None | None | None | None |
1 | After signing the deed | 1850 | To renovate | 9 m | 2 | 1 | None | Urban | 113 m² square meters | None | Yes | Semi equipped | None | 2 | 14 m² square meters | 10 m² square meters | None | 1 | 1 | None | None | No | 73 m² square meters | Connected | None | None | Yes | 307 kWh/m² kilowatt hour per square meters | D | 20230306014621 | 76 kg CO₂/m² | 43587 kWh/year | Fuel oil | Yes | Make an offer starting from € 89,000 Make an offer starting from 89000 € | € 541 541 € | No | Chaussée de l'Ourthe 65 6900 - Marche-en-Famenne | http://www.weinvest.be | 5246367 | 2023-09-27 14:02:10.641758 | https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929 | No | Living area (residential, urban or rural) | No | No | No | None | None | 10 m meters |
2 | After signing the deed | 1949 | To renovate | 18 m | 3 | 1 | None | Isolated | 139 m² square meters | 10 m² square meters | Yes | Semi equipped | 13 m² square meters | 2 | 17 m² square meters | 11 m² square meters | None | 1 | 1 | Yes | Yes | No | 413 m² square meters | Connected | None | 315 m² square meters | Yes | 699 kWh/m² kilowatt hour per square meters | G | 20230303013078 | 173 kg CO₂/m² | 96913 kWh/year | Fuel oil | Yes | Make an offer starting from € 150,000 Make an offer starting from 150000 € | € 689 689 € | No | Rue de la Wallonie 2A 4680 - Oupeye | http://www.nigel-immo.be | 5534704 | 2023-09-27 14:02:10.905808 | https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582 | None | Living area (residential, urban or rural) | No | No | No | Non flood zone | None | 18 m meters |
3 | After signing the deed | 1899 | Good | 4.2 m | 2 | None | None | Urban | 119 m² square meters | 14 m² square meters | None | USA installed | 17 m² square meters | 3 | 14 m² square meters | 14 m² square meters | 10 m² square meters | 2 | 2 | None | None | No | 55 m² square meters | Connected | Yes | None | Yes | 246 kWh/m² kilowatt hour per square meters | C | 20230622-0002923649-RES-1 | 4167 kg CO₂/m² | Not specified | Gas | Yes | € 272,000 272000 € | € 898 898 € | No | Sint-Denijslaan 1 9000 - Gent | http://www.immodavinci.be | 5535455 | 2023-09-27 14:02:11.811691 | https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392 | None | Living area (residential, urban or rural) | No | Yes | No | Non flood zone | Yes | None |
4 | At delivery | 2021 | As new | 7.3 m | 3 | None | 1 | Living area (residential, urban or rural) | 215 m² square meters | 46 m² square meters | None | USA hyper equipped | None | 3 | 16 m² square meters | 12 m² square meters | 12 m² square meters | 1 | 2 | None | None | No | 330 m² square meters | Connected | Yes | 250 m² square meters | Yes | 27 kWh/m² kilowatt hour per square meters | A | 12345 | Not specified | Not specified | Gas | Yes | € 413,150 413150 € | None | No | Astridlaan 1 1700 - Dilbeek | http://www.living-stone.be | 5527171 | 2023-09-27 14:02:11.968969 | https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333 | None | Living area (residential, urban or rural) | None | No | Yes | Non flood zone | Yes | None |
Based on the dataset, we’ve identified two primary tasks that need to be performed across multiple columns:
Handling Numerical Columns: This involves extracting numerical data using regex and converting it to float format.
Dealing with Binary Columns: Many columns contain binary values, such as “Yes” and “No.” We can easily convert these columns to boolean data types instead of string representations.
Special Handling for Certain Columns: Some columns, like “flood_zone_type” and “connection_to_sewer_network,” also have low cardinality and should be converted to boolean values. However, their values do not align with the typical “True” and “False” boolean mapping. Instead, they require a unique dictionary mapping compared to the other boolean columns.
The pre_process_dataframe
function below serves as a solid starting point. Some may suggest that breaking it into multiple subfunctions could improve maintainability and enable unit testing. However, for the time being, we’ll maintain it in its current form.
def pre_process_dataframe(df: pd.DataFrame) -> pd.DataFrame:
"""
Preprocesses a DataFrame by performing various data cleaning and transformation tasks.
Args:
df (pandas.DataFrame): The input DataFrame to be preprocessed.
Returns:
pandas.DataFrame: The preprocessed DataFrame.
"""
def extract_numbers(df: pd.DataFrame, columns: list):
"""
Extracts numeric values from specified columns in the DataFrame.
Args:
df (pandas.DataFrame): The DataFrame to extract values from.
columns (list): List of column names to extract numeric values from.
Returns:
pandas.DataFrame: The DataFrame with extracted numeric values.
"""
for column in columns:
try:
df[column] = df[column].str.extract(r"(\d+)").astype("float32")
except Exception as e:
print(f"Error processing column {column}: {e}")
return df
def map_values(df: pd.DataFrame, columns: list):
"""
Maps boolean values in specified columns to True, False, or None.
Args:
df (pandas.DataFrame): The DataFrame to map values in.
columns (list): List of column names with boolean values to be mapped.
Returns:
pandas.DataFrame: The DataFrame with mapped boolean values.
"""
for column in columns:
try:
df[column] = df[column].map({"Yes": True, None: False, "No": False})
except Exception as e:
print(f"Error processing column {column}: {e}")
return df
number_columns = [
"construction_year",
"street_frontage_width",
"number_of_frontages",
"covered_parking_spaces",
"outdoor_parking_spaces",
"living_area",
"living_room_surface",
"kitchen_surface",
"bedrooms",
"bedroom_1_surface",
"bedroom_2_surface",
"bedroom_3_surface",
"bathrooms",
"toilets",
"surface_of_the_plot",
"width_of_the_lot_on_the_street",
"garden_surface",
"primary_energy_consumption",
"co2_emission",
"yearly_theoretical_total_energy_consumption",
]
boolean_columns = [
"basement",
"furnished",
"gas_water__electricity",
"double_glazing",
"planning_permission_obtained",
"tv_cable",
"dining_room",
"proceedings_for_breach_of_planning_regulations",
"subdivision_permit",
"tenement_building",
"possible_priority_purchase_right",
]
return (
df.sort_index(axis=1)
.fillna(np.nan)
.rename(
columns=lambda column: column.lower()
.replace(" ", "_")
.replace("&", "")
.replace(",", "")
)
.rename(columns={"co₂_emission": "co2_emission"})
.pipe(lambda df: extract_numbers(df, number_columns))
.pipe(lambda df: map_values(df, boolean_columns))
.assign(
flood_zone_type=lambda df: df.flood_zone_type.map(
{
"Non flood zone": False,
"No": False,
"Possible flood zone": True,
}
),
connection_to_sewer_network=lambda df: df.connection_to_sewer_network.map(
{
"Connected": True,
"Not connected": False,
}
),
as_built_plan=lambda df: df.as_built_plan.map(
{
"Yes, conform": True,
"No": False,
}
),
cadastral_income=lambda df: df.cadastral_income.str.split(" ", expand=True)[
3
].astype("float32"),
price=lambda df: df.price.str.rsplit(" ", expand=True, n=2)[1].astype(
float
),
)
)
df_pre_processed = pre_process_dataframe(df)
df_pre_processed.head().style.set_sticky(axis=0)
address | as_built_plan | available_as_of | basement | bathrooms | bedroom_1_surface | bedroom_2_surface | bedroom_3_surface | bedrooms | building_condition | co2_emission | cadastral_income | connection_to_sewer_network | construction_year | covered_parking_spaces | dining_room | double_glazing | energy_class | external_reference | flood_zone_type | furnished | garden_surface | gas_water__electricity | heating_type | kitchen_surface | kitchen_type | latest_land_use_designation | living_area | living_room_surface | number_of_frontages | office | outdoor_parking_spaces | planning_permission_obtained | possible_priority_purchase_right | price | primary_energy_consumption | proceedings_for_breach_of_planning_regulations | reference_number_of_the_epc_report | street_frontage_width | subdivision_permit | surface_of_the_plot | surroundings_type | tv_cable | tenement_building | toilets | website | width_of_the_lot_on_the_street | yearly_theoretical_total_energy_consumption | ad_url | day_of_retrieval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | drève richelle 96 1410 - Waterloo | nan | After signing the deed | True | 4.000000 | 28.000000 | 24.000000 | 21.000000 | 5.000000 | Good | 74.000000 | 7615.000000 | False | 1975.000000 | 2.000000 | True | True | D | 5530019 | nan | False | 4300.000000 | True | Fuel oil | 21.000000 | USA hyper equipped | nan | 420.000000 | 63.000000 | 4.000000 | Yes | 2.000000 | nan | nan | 1225000.000000 | 296.000000 | nan | 20210910012683 | 23.000000 | nan | 4677.000000 | Living area (residential, urban or rural) | True | False | 5.000000 | http://www.bytheway.be | nan | 133062.000000 | https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811 | 2023-09-27 14:02:10.370650 |
1 | Chaussée de l'Ourthe 65 6900 - Marche-en-Famenne | False | After signing the deed | nan | 1.000000 | 14.000000 | 10.000000 | nan | 2.000000 | To renovate | 76.000000 | 541.000000 | True | 1850.000000 | 1.000000 | True | True | D | 5246367 | nan | False | nan | nan | Fuel oil | nan | Semi equipped | Living area (residential, urban or rural) | 113.000000 | nan | 2.000000 | nan | nan | nan | False | 89000.000000 | 307.000000 | False | 20230306014621 | 9.000000 | False | 73.000000 | Urban | True | False | 1.000000 | http://www.weinvest.be | 10.000000 | 43587.000000 | https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929 | 2023-09-27 14:02:10.641758 |
2 | Rue de la Wallonie 2A 4680 - Oupeye | nan | After signing the deed | True | 1.000000 | 17.000000 | 11.000000 | nan | 2.000000 | To renovate | 173.000000 | 689.000000 | True | 1949.000000 | 1.000000 | True | True | G | 5534704 | False | False | 315.000000 | nan | Fuel oil | 13.000000 | Semi equipped | Living area (residential, urban or rural) | 139.000000 | 10.000000 | 3.000000 | Yes | nan | nan | False | 150000.000000 | 699.000000 | False | 20230303013078 | 18.000000 | False | 413.000000 | Isolated | True | False | 1.000000 | http://www.nigel-immo.be | 18.000000 | 96913.000000 | https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582 | 2023-09-27 14:02:10.905808 |
3 | Sint-Denijslaan 1 9000 - Gent | nan | After signing the deed | nan | 2.000000 | 14.000000 | 14.000000 | 10.000000 | 3.000000 | Good | 4167.000000 | 898.000000 | True | 1899.000000 | nan | nan | True | C | 5535455 | False | False | nan | True | Gas | 17.000000 | USA installed | Living area (residential, urban or rural) | 119.000000 | 14.000000 | 2.000000 | nan | nan | True | True | 272000.000000 | 246.000000 | False | 20230622-0002923649-RES-1 | 4.000000 | False | 55.000000 | Urban | True | False | 2.000000 | http://www.immodavinci.be | nan | nan | https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392 | 2023-09-27 14:02:11.811691 |
4 | Astridlaan 1 1700 - Dilbeek | nan | At delivery | nan | 1.000000 | 16.000000 | 12.000000 | 12.000000 | 3.000000 | As new | nan | nan | True | 2021.000000 | nan | nan | True | A | 5527171 | False | False | 250.000000 | True | Gas | nan | USA hyper equipped | Living area (residential, urban or rural) | 215.000000 | 46.000000 | 3.000000 | nan | 1.000000 | True | False | 413150.000000 | 27.000000 | nan | 12345 | 7.000000 | True | 330.000000 | Living area (residential, urban or rural) | True | False | 2.000000 | http://www.living-stone.be | nan | nan | https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333 | 2023-09-27 14:02:11.968969 |
A crucial task on our agenda is the thorough parsing of address information. This is where the separate_address
function comes into play. While this function provides a solid foundation for parsing location details, we’ve found that achieving high accuracy using regex alone can be challenging. For this reason, we’ve chosen to leverage the Google Maps API to extract location details effectively.
The function that’s integrated into our final pipeline is get_location_details_from_google
, which can be located in the pre_process.py
module. This step is essential for extracting key details such as the city, ZIP code, house number, and stre reliably with high accuracyet. As we all know, location plays a vital role in real estate price estimation.
It’s worth mentioning that we’ve removed the original address field to eliminate redundant data and ensure our dataset is more streamlined and efficions.
def separate_address(df: pd.DataFrame) -> pd.DataFrame:
"""Separates the address into city, street name, house number, and zip code.
Args:
df (pd.DataFrame): The DataFrame containing the address column.
Returns:
pd.DataFrame: The DataFrame with the address separated into different columns.
"""
# Define a regular expression pattern to extract street, house number, and zip code
pattern = r"(?P<street_name>.*?)\s*(?P<house_number>\d+\w*)?\s*(?P<zip>\d{4})"
try:
return df.assign(
city=lambda df: df.address.str.rsplit("-", expand=True, n=1)[1].str.title(),
**(lambda dfx: dfx.rename(columns={"address": "original_address"}))(
df["address"].str.extract(pattern)
),
street=lambda df: df.street_name.str.replace(
r"[^a-zA-Z\s]", "", regex=True
),
).drop(columns=["street_name", "address"])
except Exception as e:
print(f"Error separating address: {e}")
return df
finer_pre_cleaned = separate_address(df_pre_processed)
finer_pre_cleaned.head().style.set_sticky(axis=0)
as_built_plan | available_as_of | basement | bathrooms | bedroom_1_surface | bedroom_2_surface | bedroom_3_surface | bedrooms | building_condition | co2_emission | cadastral_income | connection_to_sewer_network | construction_year | covered_parking_spaces | dining_room | double_glazing | energy_class | external_reference | flood_zone_type | furnished | garden_surface | gas_water__electricity | heating_type | kitchen_surface | kitchen_type | latest_land_use_designation | living_area | living_room_surface | number_of_frontages | office | outdoor_parking_spaces | planning_permission_obtained | possible_priority_purchase_right | price | primary_energy_consumption | proceedings_for_breach_of_planning_regulations | reference_number_of_the_epc_report | street_frontage_width | subdivision_permit | surface_of_the_plot | surroundings_type | tv_cable | tenement_building | toilets | website | width_of_the_lot_on_the_street | yearly_theoretical_total_energy_consumption | ad_url | day_of_retrieval | city | house_number | zip | street | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | nan | After signing the deed | True | 4.000000 | 28.000000 | 24.000000 | 21.000000 | 5.000000 | Good | 74.000000 | 7615.000000 | False | 1975.000000 | 2.000000 | True | True | D | 5530019 | nan | False | 4300.000000 | True | Fuel oil | 21.000000 | USA hyper equipped | nan | 420.000000 | 63.000000 | 4.000000 | Yes | 2.000000 | nan | nan | 1225000.000000 | 296.000000 | nan | 20210910012683 | 23.000000 | nan | 4677.000000 | Living area (residential, urban or rural) | True | False | 5.000000 | http://www.bytheway.be | nan | 133062.000000 | https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811 | 2023-09-27 14:02:10.370650 | Waterloo | 96 | 1410 | drve richelle |
1 | False | After signing the deed | nan | 1.000000 | 14.000000 | 10.000000 | nan | 2.000000 | To renovate | 76.000000 | 541.000000 | True | 1850.000000 | 1.000000 | True | True | D | 5246367 | nan | False | nan | nan | Fuel oil | nan | Semi equipped | Living area (residential, urban or rural) | 113.000000 | nan | 2.000000 | nan | nan | nan | False | 89000.000000 | 307.000000 | False | 20230306014621 | 9.000000 | False | 73.000000 | Urban | True | False | 1.000000 | http://www.weinvest.be | 10.000000 | 43587.000000 | https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929 | 2023-09-27 14:02:10.641758 | Famenne | 65 | 6900 | Chausse de lOurthe |
2 | nan | After signing the deed | True | 1.000000 | 17.000000 | 11.000000 | nan | 2.000000 | To renovate | 173.000000 | 689.000000 | True | 1949.000000 | 1.000000 | True | True | G | 5534704 | False | False | 315.000000 | nan | Fuel oil | 13.000000 | Semi equipped | Living area (residential, urban or rural) | 139.000000 | 10.000000 | 3.000000 | Yes | nan | nan | False | 150000.000000 | 699.000000 | False | 20230303013078 | 18.000000 | False | 413.000000 | Isolated | True | False | 1.000000 | http://www.nigel-immo.be | 18.000000 | 96913.000000 | https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582 | 2023-09-27 14:02:10.905808 | Oupeye | 2A | 4680 | Rue de la Wallonie |
3 | nan | After signing the deed | nan | 2.000000 | 14.000000 | 14.000000 | 10.000000 | 3.000000 | Good | 4167.000000 | 898.000000 | True | 1899.000000 | nan | nan | True | C | 5535455 | False | False | nan | True | Gas | 17.000000 | USA installed | Living area (residential, urban or rural) | 119.000000 | 14.000000 | 2.000000 | nan | nan | True | True | 272000.000000 | 246.000000 | False | 20230622-0002923649-RES-1 | 4.000000 | False | 55.000000 | Urban | True | False | 2.000000 | http://www.immodavinci.be | nan | nan | https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392 | 2023-09-27 14:02:11.811691 | Gent | 1 | 9000 | SintDenijslaan |
4 | nan | At delivery | nan | 1.000000 | 16.000000 | 12.000000 | 12.000000 | 3.000000 | As new | nan | nan | True | 2021.000000 | nan | nan | True | A | 5527171 | False | False | 250.000000 | True | Gas | nan | USA hyper equipped | Living area (residential, urban or rural) | 215.000000 | 46.000000 | 3.000000 | nan | 1.000000 | True | False | 413150.000000 | 27.000000 | nan | 12345 | 7.000000 | True | 330.000000 | Living area (residential, urban or rural) | True | False | 2.000000 | http://www.living-stone.be | nan | nan | https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333 | 2023-09-27 14:02:11.968969 | Dilbeek | 1 | 1700 | Astridlaan |
Analyzing the error log file, we’ve identified a total of 3,515 errors encountered during the web scraping process on the Immoweb website. Let’s delve into these errors to pinpoint the most common issues and address them accordingly.
error | |
---|---|
0 | 2023-09-27 14:02:11 - ERROR - No tables found ... |
1 | 2023-09-27 14:02:17 - ERROR - No tables found ... |
2 | 2023-09-27 14:02:22 - ERROR - No tables found ... |
3 | 2023-09-27 14:02:25 - ERROR - No tables found ... |
4 | 2023-09-27 14:02:34 - ERROR - No tables found ... |
... | ... |
3510 | 2023-09-27 16:43:45 - ERROR - Duplicate labels... |
3511 | 2023-09-27 16:43:57 - ERROR - No tables found ... |
3512 | 2023-09-27 16:43:57 - ERROR - Duplicate labels... |
3513 | 2023-09-27 16:43:57 - ERROR - Duplicate labels... |
3514 | 2023-09-27 16:44:03 - ERROR - No tables found ... |
3515 rows × 1 columns
It’s clear that a significant majority of the errors, accounting for 1,848 cases, result from the absence of tables on the pages. These errors are primarily found on listing ads and index pages. To address this issue, we’ve introduced an if
clause into our method extract_ads_from_given_page
, which can be found in the make_dataset.py
module. The clause, if "immoweb.be" in item and "https://www.immoweb.be/en/search" not in item
, enables us to filter out undesired pages that don’t contain relevant table information for our ads. This not only helps mitigate errors but also speeds up the dataset collection process by reducing the number of pages we scrape.
Another category of errors, totaling 1,460 cases, is related to the presence of duplicate labels during processing. We may need to investigate this issue further at a later stage to ensure data quality and accuracy.
A smaller proportion of errors is linked to the “Empty data” message, primarily related to ads. Finally, the remaining errors encompass errorrs related to data type conversion. We can consider either leaving these columns as is, since the error is not that frequent, or removing these features altogether.
0
No tables found while processing 1848
Duplicate labels found while processing 1460
Empty data while processing 18
An error occurred on page 327: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type 1
An error occurred on page 255: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type 1
...
An error occurred on page 173: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type 1
An error occurred on page 174: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type 1
An error occurred on page 176: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type 1
An error occurred on page 177: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type 1
An error occurred on page 331: ("Expected bytes, got a 'int' object", 'Conversion failed for column Outdoor parking spaces with type 1
Name: count, Length: 192, dtype: int64
We’ve identified five columns responsible for the last set of data type conversion problems (See below):
(
error_log.error.str.split("-", expand=True)[4]
.str.rsplit(" ", n=1, expand=True)[0]
.value_counts()[3:]
.to_frame()
.reset_index()
.rename(columns={0: "error_type"})
.assign(
error_type=lambda df: df.error_type.str.split(",", expand=True)[2].str.split(
" ", n=5, expand=True
)[5]
)
.error_type.value_counts()
)
error_type
Construction year with type 83
Number of frontages with type 65
Outdoor parking spaces with type 7
Covered parking spaces with type 7
Bedrooms with type 1
Name: count, dtype: int64
Upon conducting a more comprehensive analysis of the URLs extracted from error messages, a noteworthy observation comes to light: we’ve encountered only 433 unique URLs. This suggests that the 3,515 errors are stemming from a relatively restricted set of web addresses.
Now, here’s a question that arises: the website implies the presence of 10,000 ads on the page. However, given our successful extraction of only 3,906 ads, along with the 433 URLs associated with errors, there is a substantial disparity evident.
If you have any insights or hypotheses regarding this difference, I’d be eager to hear your thoughts and discuss potential reasons for this variation.
After eliminating rows where all values were missing and filtering for rows with non-missing prices, we’ve successfully refined our dataset to include 3,660 ads.
In our subsequent analysis, we focus on the features with the lowest percentage of missing da, just like we did in Part 1ta. Notably, “day of retrieval” and “price” are complete, with all values present. However, it’s important to recognize that roughly one-third of the data related to “dining_room” and “office” is missing, highlighting the need for improving data completeness in these specific attributet.s.
dining_room 29.3
office 29.5
planning_permission_obtained 32.1
tv_cable 34.0
proceedings_for_breach_of_planning_regulations 36.3
subdivision_permit 38.2
yearly_theoretical_total_energy_consumption 39.5
width_of_the_lot_on_the_street 42.4
co2_emission 43.0
connection_to_sewer_network 44.7
possible_priority_purchase_right 45.5
street_frontage_width 46.5
basement 46.6
as_built_plan 47.0
latest_land_use_designation 47.1
garden_surface 47.4
outdoor_parking_spaces 48.0
furnished 48.4
surroundings_type 50.8
flood_zone_type 54.4
bedroom_3_surface 54.8
covered_parking_spaces 54.9
kitchen_surface 58.2
available_as_of 58.3
living_room_surface 64.7
bedroom_2_surface 66.6
gas_water__electricity 67.0
bedroom_1_surface 68.2
construction_year 71.0
cadastral_income 78.3
kitchen_type 83.4
double_glazing 84.6
website 84.6
heating_type 87.2
external_reference 90.3
toilets 90.8
number_of_frontages 94.6
bathrooms 94.9
house_number 94.9
primary_energy_consumption 96.0
building_condition 96.0
surface_of_the_plot 96.6
living_area 98.2
tenement_building 99.1
zip 99.5
city 99.5
bedrooms 99.5
street 99.5
reference_number_of_the_epc_report 99.9
energy_class 99.9
price 100.0
ad_url 100.0
day_of_retrieval 100.0
dtype: float64
Our filter_out_missing_indexes
function proves to be quite valuable in the post-processing of our scraped data. This function is located at the end of our pre-process chain saving our data to the INTERIM_DATA_PATH folder after we’ve completed pre-processing and removed missing values.
def filter_out_missing_indexes(
df: pd.DataFrame,
filepath: Path = utils.Configuration.INTERIM_DATA_PATH.joinpath(
f"{str(pd.Timestamp.now())[:10]}_Processed_dataset.parquet.gzip"
),
) -> pd.DataFrame:
"""
Filter out rows with missing values in a DataFrame and save the processed dataset.
This function filters out rows with all missing values (NaN) and retains only rows
with non-missing values in the 'price' column. The resulting DataFrame is then saved
in Parquet format with gzip compression.
Args:
df (pd.DataFrame): The input DataFrame.
filepath (Path, optional): The path to save the processed dataset in Parquet format.
Defaults to a timestamp-based filepath in the interim data directory.
Returns:
pd.DataFrame: The filtered DataFrame with missing rows removed.
Example:
To filter out missing rows and save the processed dataset:
>>> data = pd.read_csv("raw_data.csv")
>>> filtered_data = filter_out_missing_indexes(data)
>>> print(filtered_data.head())
Notes:
- Rows with missing values in any column other than 'price' are removed.
- The processed dataset is saved with gzip compression to conserve disk space.
"""
processed_df = df.dropna(axis=0, how="all").query("price.notna()")
processed_df.to_parquet(filepath, compression="gzip", index=False)
return processed_df
It appears that we’ve successfully completed the data transformation phase of our scraped dataset. With the implementation of functions like filter_out_missing_indexes
, alongside pre_process_dataframe
and separate_address
, we’ve assembled the essential tools required for preparing our dataset for the machine learning pipeline.
In Part 3, we’ll provide a fundamental overview and characterization of the cleaned scraped data. We’ll assess feature cardinality, examine distributions, and explore correlations among variables. I look forward to delving into these insights with you in the next installment. See you there!
---
title: 'Predicting Belgian Real Estate Prices: Part 2: Data Cleanup After Web Scraping'
author: Adam Cseresznye
date: '2023-11-04'
categories:
- Predicting Belgian Real Estate Prices
jupyter: python3
toc: true
format:
html:
code-fold: true
code-tools: true
---
![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](https://cf.bstatic.com/xdata/images/hotel/max1024x768/408003083.jpg?k=c49b5c4a2346b3ab002b9d1b22dbfb596cee523b53abef2550d0c92d0faf2d8b&o=&hp=1){fig-align="center" width=50%}
In part 1, we provided a brief introduction to the project's purpose. Now, in part 2, we will dive deeper into the data processing steps required after scraping. We will discuss the handling of numerical data, categorical variables, and boolean values. Additionally, we'll assess the data quality by examining the error log generated by the `Immowebscraper` class. Let's get to it!
::: {.callout-note}
You can access the project's app through its [Streamlit website](https://belgian-house-price-predictor.streamlit.app/).
:::
# Import data
```{python}
#| editable: true
#| slideshow: {slide_type: ''}
#| tags: []
import time
from pathlib import Path
import numpy as np
import pandas as pd
from data import pre_process, utils
from lets_plot import *
from lets_plot.mapping import as_discrete
LetsPlot.setup_html()
```
# Data Pre-cleaning steps
In the upcoming section, we will delve into fundamental post-web scraping procedures. While scraping, we acquired 50 features, making our dataset information-rich. However, there's significant work ahead due to the data's lack of cleanliness. This entails tasks such as dtype conversion, column parsing to extract numerical values, and the transformation of Boolean values into binary variables.
Here is the dataset we have gathered through web scraping:
```{python}
for filename in utils.Configuration.RAW_DATA_PATH.glob("*.gzip"):
if "for_NB2" in filename.stem:
df = pd.read_parquet(filename)
print(df.shape)
df.head().style.set_sticky(axis=0)
```
Based on the dataset, we've identified two primary tasks that need to be performed across multiple columns:
1. **Handling Numerical Columns:** This involves extracting numerical data using regex and converting it to float format.
2. **Dealing with Binary Columns:** Many columns contain binary values, such as "Yes" and "No." We can easily convert these columns to boolean data types instead of string representations.
3. **Special Handling for Certain Columns:** Some columns, like "flood_zone_type" and "connection_to_sewer_network," also have low cardinality and should be converted to boolean values. However, their values do not align with the typical "True" and "False" boolean mapping. Instead, they require a unique dictionary mapping compared to the other boolean columns.
The `pre_process_dataframe` function below serves as a solid starting point. Some may suggest that breaking it into multiple subfunctions could improve maintainability and enable unit testing. However, for the time being, we'll maintain it in its current form.
```{python}
def pre_process_dataframe(df: pd.DataFrame) -> pd.DataFrame:
"""
Preprocesses a DataFrame by performing various data cleaning and transformation tasks.
Args:
df (pandas.DataFrame): The input DataFrame to be preprocessed.
Returns:
pandas.DataFrame: The preprocessed DataFrame.
"""
def extract_numbers(df: pd.DataFrame, columns: list):
"""
Extracts numeric values from specified columns in the DataFrame.
Args:
df (pandas.DataFrame): The DataFrame to extract values from.
columns (list): List of column names to extract numeric values from.
Returns:
pandas.DataFrame: The DataFrame with extracted numeric values.
"""
for column in columns:
try:
df[column] = df[column].str.extract(r"(\d+)").astype("float32")
except Exception as e:
print(f"Error processing column {column}: {e}")
return df
def map_values(df: pd.DataFrame, columns: list):
"""
Maps boolean values in specified columns to True, False, or None.
Args:
df (pandas.DataFrame): The DataFrame to map values in.
columns (list): List of column names with boolean values to be mapped.
Returns:
pandas.DataFrame: The DataFrame with mapped boolean values.
"""
for column in columns:
try:
df[column] = df[column].map({"Yes": True, None: False, "No": False})
except Exception as e:
print(f"Error processing column {column}: {e}")
return df
number_columns = [
"construction_year",
"street_frontage_width",
"number_of_frontages",
"covered_parking_spaces",
"outdoor_parking_spaces",
"living_area",
"living_room_surface",
"kitchen_surface",
"bedrooms",
"bedroom_1_surface",
"bedroom_2_surface",
"bedroom_3_surface",
"bathrooms",
"toilets",
"surface_of_the_plot",
"width_of_the_lot_on_the_street",
"garden_surface",
"primary_energy_consumption",
"co2_emission",
"yearly_theoretical_total_energy_consumption",
]
boolean_columns = [
"basement",
"furnished",
"gas_water__electricity",
"double_glazing",
"planning_permission_obtained",
"tv_cable",
"dining_room",
"proceedings_for_breach_of_planning_regulations",
"subdivision_permit",
"tenement_building",
"possible_priority_purchase_right",
]
return (
df.sort_index(axis=1)
.fillna(np.nan)
.rename(
columns=lambda column: column.lower()
.replace(" ", "_")
.replace("&", "")
.replace(",", "")
)
.rename(columns={"co₂_emission": "co2_emission"})
.pipe(lambda df: extract_numbers(df, number_columns))
.pipe(lambda df: map_values(df, boolean_columns))
.assign(
flood_zone_type=lambda df: df.flood_zone_type.map(
{
"Non flood zone": False,
"No": False,
"Possible flood zone": True,
}
),
connection_to_sewer_network=lambda df: df.connection_to_sewer_network.map(
{
"Connected": True,
"Not connected": False,
}
),
as_built_plan=lambda df: df.as_built_plan.map(
{
"Yes, conform": True,
"No": False,
}
),
cadastral_income=lambda df: df.cadastral_income.str.split(" ", expand=True)[
3
].astype("float32"),
price=lambda df: df.price.str.rsplit(" ", expand=True, n=2)[1].astype(
float
),
)
)
df_pre_processed = pre_process_dataframe(df)
df_pre_processed.head().style.set_sticky(axis=0)
```
A crucial task on our agenda is the thorough parsing of address information. This is where the `separate_address` function comes into play. While this function provides a solid foundation for parsing location details, we've found that achieving high accuracy using regex alone can be challenging. For this reason, we've chosen to leverage the Google Maps API to extract location details effectively.
The function that's integrated into our final pipeline is `get_location_details_from_google`, which can be located in the `pre_process.py` module. This step is essential for extracting key details such as the city, ZIP code, house number, and stre reliably with high accuracyet. As we all know, location plays a vital role in real estate price estimation.
It's worth mentioning that we've removed the original address field to eliminate redundant data and ensure our dataset is more streamlined and efficions.
```{python}
def separate_address(df: pd.DataFrame) -> pd.DataFrame:
"""Separates the address into city, street name, house number, and zip code.
Args:
df (pd.DataFrame): The DataFrame containing the address column.
Returns:
pd.DataFrame: The DataFrame with the address separated into different columns.
"""
# Define a regular expression pattern to extract street, house number, and zip code
pattern = r"(?P<street_name>.*?)\s*(?P<house_number>\d+\w*)?\s*(?P<zip>\d{4})"
try:
return df.assign(
city=lambda df: df.address.str.rsplit("-", expand=True, n=1)[1].str.title(),
**(lambda dfx: dfx.rename(columns={"address": "original_address"}))(
df["address"].str.extract(pattern)
),
street=lambda df: df.street_name.str.replace(
r"[^a-zA-Z\s]", "", regex=True
),
).drop(columns=["street_name", "address"])
except Exception as e:
print(f"Error separating address: {e}")
return df
finer_pre_cleaned = separate_address(df_pre_processed)
finer_pre_cleaned.head().style.set_sticky(axis=0)
```
# Inspecting data quality
## Reading in and inspecting the log file
Analyzing the error log file, we've identified a total of 3,515 errors encountered during the web scraping process on the Immoweb website. Let's delve into these errors to pinpoint the most common issues and address them accordingly.
```{python}
error_log = pd.read_table(
utils.Configuration.RAW_DATA_PATH.joinpath("make_dataset_error_for_NB2.log"),
header=None,
).rename(columns={0: "error"})
error_log
```
### Most common errors from log file
It's clear that a significant majority of the errors, accounting for 1,848 cases, result from the absence of tables on the pages. These errors are primarily found on listing ads and index pages. To address this issue, we've introduced an `if` clause into our method `extract_ads_from_given_page`, which can be found in the `make_dataset.py` module. The clause, `if "immoweb.be" in item and "https://www.immoweb.be/en/search" not in item`, enables us to filter out undesired pages that don't contain relevant table information for our ads. This not only helps mitigate errors but also speeds up the dataset collection process by reducing the number of pages we scrape.
Another category of errors, totaling 1,460 cases, is related to the presence of duplicate labels during processing. We may need to investigate this issue further at a later stage to ensure data quality and accuracy.
A smaller proportion of errors is linked to the "Empty data" message, primarily related to ads. Finally, the remaining errors encompass errorrs related to data type conversion. We can consider either leaving these columns as is, since the error is not that frequent, or removing these features altogether.
```{python}
(
error_log.error.str.split("-", expand=True)[4]
.str.rsplit(" ", n=1, expand=True)[0]
.value_counts()
)
```
We've identified five columns responsible for the last set of data type conversion problems (See below):
1. Construction year
2. Number of frontages
3. Outdoor parking spaces
4. Covered parking spaces
5. Bedrooms
```{python}
(
error_log.error.str.split("-", expand=True)[4]
.str.rsplit(" ", n=1, expand=True)[0]
.value_counts()[3:]
.to_frame()
.reset_index()
.rename(columns={0: "error_type"})
.assign(
error_type=lambda df: df.error_type.str.split(",", expand=True)[2].str.split(
" ", n=5, expand=True
)[5]
)
.error_type.value_counts()
)
```
### Unique URLs from the error logs
Upon conducting a more comprehensive analysis of the URLs extracted from error messages, a noteworthy observation comes to light: we've encountered only 433 unique URLs. This suggests that the 3,515 errors are stemming from a relatively restricted set of web addresses.
Now, here's a question that arises: the website implies the presence of 10,000 ads on the page. However, given our successful extraction of only 3,906 ads, along with the 433 URLs associated with errors, there is a substantial disparity evident.
If you have any insights or hypotheses regarding this difference, I'd be eager to hear your thoughts and discuss potential reasons for this variation.
```{python}
(
error_log.error.str.split("-", expand=True)[4]
.str.rsplit(" ", n=1, expand=True)[1]
.unique()
.shape
)
```
## Inspecting the data itself
After eliminating rows where all values were missing and filtering for rows with non-missing prices, we've successfully refined our dataset to include 3,660 ads.
In our subsequent analysis, we focus on the features with the lowest percentage of missing da, just like we did in Part 1ta. Notably, "day of retrieval" and "price" are complete, with all values present. However, it's important to recognize that roughly one-third of the data related to "dining_room" and "office" is missing, highlighting the need for improving data completeness in these specific attributet.s.
```{python}
(
finer_pre_cleaned.dropna(axis=0, how="all")
.query("price.notna()")
.notna()
.sum()
.sort_values()
.div(3660)
.mul(100)
.round(1)
)
```
Our `filter_out_missing_indexes` function proves to be quite valuable in the post-processing of our scraped data. This function is located at the end of our pre-process chain saving our data to the INTERIM_DATA_PATH folder after we've completed pre-processing and removed missing values.
```{python}
def filter_out_missing_indexes(
df: pd.DataFrame,
filepath: Path = utils.Configuration.INTERIM_DATA_PATH.joinpath(
f"{str(pd.Timestamp.now())[:10]}_Processed_dataset.parquet.gzip"
),
) -> pd.DataFrame:
"""
Filter out rows with missing values in a DataFrame and save the processed dataset.
This function filters out rows with all missing values (NaN) and retains only rows
with non-missing values in the 'price' column. The resulting DataFrame is then saved
in Parquet format with gzip compression.
Args:
df (pd.DataFrame): The input DataFrame.
filepath (Path, optional): The path to save the processed dataset in Parquet format.
Defaults to a timestamp-based filepath in the interim data directory.
Returns:
pd.DataFrame: The filtered DataFrame with missing rows removed.
Example:
To filter out missing rows and save the processed dataset:
>>> data = pd.read_csv("raw_data.csv")
>>> filtered_data = filter_out_missing_indexes(data)
>>> print(filtered_data.head())
Notes:
- Rows with missing values in any column other than 'price' are removed.
- The processed dataset is saved with gzip compression to conserve disk space.
"""
processed_df = df.dropna(axis=0, how="all").query("price.notna()")
processed_df.to_parquet(filepath, compression="gzip", index=False)
return processed_df
```
It appears that we've successfully completed the data transformation phase of our scraped dataset. With the implementation of functions like `filter_out_missing_indexes`, alongside `pre_process_dataframe` and `separate_address`, we've assembled the essential tools required for preparing our dataset for the machine learning pipeline.
In Part 3, we'll provide a fundamental overview and characterization of the cleaned scraped data. We'll assess feature cardinality, examine distributions, and explore correlations among variables. I look forward to delving into these insights with you in the next installment. See you there!