In the second part, we delved into the intricacies of data processing necessary after scraping. Our discussion covered the treatment of numerical data, handling of categorical variables, and management of boolean values. Furthermore, we evaluated the data quality by scrutinizing the error log produced by the Immowebscraper class. In the upcoming Part 3, our focus will shift to getting a fundamental overview of our data by characterizing the cleaned scraped data. Additionally, we aim to assess feature cardinality, scrutinize distributions, and explore potential correlations between features and our target variable—property price.
Our initial step involves reading the cleaned-up dataset that was initially scraped. This dataset is stored in the interim folder following the cleaning process, and we will access it from there.
Now, let’s assess the feature cardinality of our dataset to differentiate between categorical and numerical data. Initially, we will analyze the percentage of unique values per feature, followed by displaying the absolute number of unique values per feature.
Code
# Assuming df is your DataFramenumber_unique_entries = {"column_name": df.columns.tolist(),"column_dtype": [df[col].dtype for col in df.columns],"unique_values_pct": [df[col].nunique() for col in df.columns],}( pd.DataFrame(number_unique_entries) .sort_values("unique_values_pct") .assign( unique_values_pct=lambda x: x.unique_values_pct.div(df.shape[0]) .mul(100) .round(1) ) .pipe(lambda df: ggplot(df, aes("unique_values_pct", "column_name"))+ geom_bar(stat="identity", orientation="y")+ labs( title="Assessing Feature Cardinality", subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be Utilized as Categorical Variables, while Those with Higher Cardinality, typically represented as floats or integers, May Be Used as They Are """, x="Percentage of Unique Values per Feature", y="", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance )+ ggsize(800, 1000) ))
Code
( pd.DataFrame(number_unique_entries) .sort_values("unique_values_pct") .pipe(lambda df: ggplot(df, aes("unique_values_pct", "column_name"))+ geom_bar(stat="identity", orientation="y")+ labs( title="Assessing Feature Cardinality", subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be Utilized as Categorical Variables, while Those with Higher Cardinality, typically represented as floats or integers, May Be Used as They Are """, x="Number of Unique Values per Feature", y="", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance )+ ggsize(800, 1000) ))
Looking at distributions
Distribution of features
Next, our focus will be on identifying low and high cardinality features. Subsequently, we will investigate how house prices vary when grouped according to these variables, using boxplots. Please take into account that the price values have undergone log transformation to address skewness.
Upon examining the distribution of our target variable, which is the price, it becomes evident that there is a notable skew. Our median price stands at 379,000 EUR, with the lowest at 350,000 EUR and the highest reaching 10 million EUR. To increase the accuracy of our predictions, it is worth considering a transformation of our target variable before proceeding with modeling. This transformation serves several beneficial purposes:
Normalization: It has the potential to render the distribution of the target variable more symmetrical, resembling a normal distribution. Such a transformation can significantly enhance the performance of various regression models.
Equalizing Variance: By stabilizing the variance of the target variable across different price ranges, this transformation becomes particularly valuable for ensuring the effectiveness of certain regression algorithms.
Mitigating Outliers: It is effective at diminishing the impact of extreme outliers, bolstering the model’s robustness against data anomalies.
Interpretability: Notably, when interpreting model predictions, this transformation allows for straightforward back-transformation to the original scale. This can be achieved using a base 10 exponentiation, ensuring that predictions are easily interpretable in their origination task.
Next, let’s explore how median house prices vary across different locations. As indicated by the map below, Knokke-Heist boasts one of the highest median apartment prices, closely followed by Watermael-Boitsfort. This aligns with expectations, as per Wikipedia: “Watermael-Boitsfort is one of Brussels’ most affluent municipalities, with an average per capita income of €30,100 in 2002, exceeding the regional average for the Brussels-Capital Region by over €600.” On the other hand, Knokke-Heist is renowned as “It is Belgium’s best-known and most affluent seaside resort.” There you have it.
Code
grouped_city = ( df.groupby("city") .price.median() .to_frame() .reset_index() .query("~city.isin(['Blégny', 'Woluwe-St.-Lambert', 'Wolvertem'])"))boundaries = ( geocode_cities(grouped_city.city) .scope("Belgium") .inc_res(4) .get_boundaries(resolution=6))( ggplot()+ geom_livemap()+ geom_polygon( aes(color="price", fill="price"), data=grouped_city,map=boundaries, color="#dbd6d6", alpha=0.9, size=0.5, map_join=[["city"], ["city"]], tooltips=layer_tooltips(["city", "price"]), show_legend=False )+ labs("Median House Prices in Belgium Cities", subtitle="Knokke-Heist boasts one of the highest median apartment prices, closely followed by Watermael-Boitsfort.", caption="https://www.immoweb.be/", )+ theme( axis_title="blank", axis_text="blank", axis_ticks="blank", axis_line="blank", plot_subtitle=element_text(size=12, face="italic"), plot_title=element_text(size=15, face="bold"), )+ scale_fill_gradient(low="green", high="red")+ labs(fill="Median Price")+ ggsize(1000, 800))
Correlations
Finally, we will delve into the correlations among variables with high cardinality through Spearman correlation analysis, with a particular focus on the price as the target variable. As evident from the heatmap, the price exhibits a strong correlation with cadastral income (correlation coefficient = 0.77), living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59). For your reference, cadastral income is an annual Flemish property tax based on the assessed rental income of immovable properties in the Flemish Region. This income is a notional rental income assigned to each property, whether it is rented out or not.
Code
( df.loc[:, lambda df: df.columns.isin(high_cardinality_features)] .corr(method="spearman") .pipe(lambda df: corr_plot(df) .tiles("lower", ) .labels(type="lower", map_size=False) .palette_gradient(low="#2986cc", mid="#ffffff", high="#d73027") .build()+ ggsize(900, 900)+ labs( title="Spearman Correlations Among High Cardinality Features", subtitle=""" The price demonstrates robust correlations with key factors, including cadastral income (correlation coefficient = 0.77), living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59) """, x="Number of Unique Values per Feature", y="", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance ) ))
In part 3, we’ve performed some initial data exploration, and our next stride entails constructing a foundational machine learning model. In part 4, we’ll compare various algorithms to establish a benchmark for our subsequent endeavors in model building and feature engineering. This baseline model will serve as a reference point, guiding us as we strive to enhance our predictive capabilities. Looking forward to meeting you again in Part 4!
Source Code
---title: 'Predicting Belgian Real Estate Prices: Part 3: Characterizing Belgian Real Estate Data Post-Scraping'author: Adam Cseresznyedate: '2023-11-05'categories: - Predicting Belgian Real Estate Pricesjupyter: python3toc: trueformat: html: code-fold: true code-tools: true---![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](https://cf.bstatic.com/xdata/images/hotel/max1024x768/408003083.jpg?k=c49b5c4a2346b3ab002b9d1b22dbfb596cee523b53abef2550d0c92d0faf2d8b&o=&hp=1){fig-align="center" width=50%}In the second part, we delved into the intricacies of data processing necessary after scraping. Our discussion covered the treatment of numerical data, handling of categorical variables, and management of boolean values. Furthermore, we evaluated the data quality by scrutinizing the error log produced by the `Immowebscraper` class. In the upcoming Part 3, our focus will shift to getting a fundamental overview of our data by characterizing the cleaned scraped data. Additionally, we aim to assess feature cardinality, scrutinize distributions, and explore potential correlations between features and our target variable—property price.::: {.callout-note}You can access the project's app through its [Streamlit website](https://belgian-house-price-predictor.streamlit.app/).:::# Import data```{python}#| editable: true#| slideshow: {slide_type: ''}#| tags: []import timefrom pathlib import Pathimport credsimport numpy as npimport pandas as pdfrom data import pre_process, utilsfrom lets_plot import*from lets_plot.bistro.corr import*from lets_plot.geo_data import*from lets_plot.mapping import as_discretefrom tqdm import tqdmLetsPlot.setup_html(offline=True)```# Data InspectionOur initial step involves reading the cleaned-up dataset that was initially scraped. This dataset is stored in the interim folder following the cleaning process, and we will access it from there.```{python}df = pd.read_parquet( utils.Configuration.INTERIM_DATA_PATH.joinpath("2023-10-01_Processed_dataset_for_NB_use.parquet.gzip" ))df.head().style.set_sticky(axis=0)```To examine the data types of the columns, we will utilize the `info()` method.```{python}df.info()```# Assessing Feature CardinalityNow, let's assess the feature cardinality of our dataset to differentiate between categorical and numerical data. Initially, we will analyze the percentage of unique values per feature, followed by displaying the absolute number of unique values per feature.```{python}#| fig-cap: 'Assessing Feature Cardinality: Percentage of Unique Values per Feature'#| label: fig-fig1# Assuming df is your DataFramenumber_unique_entries = {"column_name": df.columns.tolist(),"column_dtype": [df[col].dtype for col in df.columns],"unique_values_pct": [df[col].nunique() for col in df.columns],}( pd.DataFrame(number_unique_entries) .sort_values("unique_values_pct") .assign( unique_values_pct=lambda x: x.unique_values_pct.div(df.shape[0]) .mul(100) .round(1) ) .pipe(lambda df: ggplot(df, aes("unique_values_pct", "column_name"))+ geom_bar(stat="identity", orientation="y")+ labs( title="Assessing Feature Cardinality", subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be Utilized as Categorical Variables, while Those with Higher Cardinality, typically represented as floats or integers, May Be Used as They Are """, x="Percentage of Unique Values per Feature", y="", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance )+ ggsize(800, 1000) ))``````{python}#| fig-cap: 'Assessing Feature Cardinality: Number of Unique Values per Feature'#| label: fig-fig2( pd.DataFrame(number_unique_entries) .sort_values("unique_values_pct") .pipe(lambda df: ggplot(df, aes("unique_values_pct", "column_name"))+ geom_bar(stat="identity", orientation="y")+ labs( title="Assessing Feature Cardinality", subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be Utilized as Categorical Variables, while Those with Higher Cardinality, typically represented as floats or integers, May Be Used as They Are """, x="Number of Unique Values per Feature", y="", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance )+ ggsize(800, 1000) ))```# Looking at distributions## Distribution of featuresNext, our focus will be on identifying low and high cardinality features. Subsequently, we will investigate how house prices vary when grouped according to these variables, using boxplots. Please take into account that the price values have undergone log transformation to address skewness.```{python}low_cardinality_features = ( pd.DataFrame(number_unique_entries) .query("unique_values_pct <= 5") .column_name.to_list())``````{python}high_cardinality_features = ( pd.DataFrame(number_unique_entries) .query("(unique_values_pct >= 5)") .loc[lambda df: (df.column_dtype =="float32") | (df.column_dtype =="float64"), :] .column_name.to_list())``````{python}#| fig-cap: Exploring Price Variations Across Different Variables#| label: fig-fig3plots = []for idx, feature inenumerate(low_cardinality_features): plot = ( df.melt(id_vars=["ad_url", "price"]) .loc[lambda df: df.variable == feature, :] .assign(price=lambda df: np.log10(df.price)) .pipe(lambda df: ggplot( df, aes(as_discrete("value"), "price"), )+ facet_wrap("variable")+ geom_boxplot( show_legend=False, ) ) ) plots.append(plot)gggrid(plots, ncol=4) + ggsize(900, 1600)```## Distribution of targetUpon examining the distribution of our target variable, which is the price, it becomes evident that there is a notable skew. Our median price stands at 379,000 EUR, with the lowest at 350,000 EUR and the highest reaching 10 million EUR. To increase the accuracy of our predictions, it is worth considering a transformation of our target variable before proceeding with modeling. This transformation serves several beneficial purposes:1. **Normalization**: It has the potential to render the distribution of the target variable more symmetrical, resembling a normal distribution. Such a transformation can significantly enhance the performance of various regression models.2. **Equalizing Variance**: By stabilizing the variance of the target variable across different price ranges, this transformation becomes particularly valuable for ensuring the effectiveness of certain regression algorithms.3. **Mitigating Outliers**: It is effective at diminishing the impact of extreme outliers, bolstering the model's robustness against data anomalies.4. **Interpretability**: Notably, when interpreting model predictions, this transformation allows for straightforward back-transformation to the original scale. This can be achieved using a base 10 exponentiation, ensuring that predictions are easily interpretable in their origination task.```{python}#| fig-cap: Target distribution before and after log10 transformation#| label: fig-fig4before_transformation = df.pipe(lambda df: ggplot(df, aes("price")) + geom_histogram()) + labs( title="Before Transformation",)after_transformation = df.assign(price=lambda df: np.log10(df.price)).pipe(lambda df: ggplot(df, aes("price"))+ geom_histogram()+ labs( title="After log10 Transformation", ))gggrid([before_transformation, after_transformation], ncol=2) + ggsize(800, 500)```# GeomappingNext, let's explore how median house prices vary across different locations. As indicated by the map below, Knokke-Heist boasts one of the highest median apartment prices, closely followed by Watermael-Boitsfort. This aligns with expectations, as per Wikipedia: _"Watermael-Boitsfort is one of Brussels' most affluent municipalities, with an average per capita income of €30,100 in 2002, exceeding the regional average for the Brussels-Capital Region by over €600."_ On the other hand, Knokke-Heist is renowned as "It is Belgium’s best-known and most affluent seaside resort." There you have it.```{python}grouped_city = ( df.groupby("city") .price.median() .to_frame() .reset_index() .query("~city.isin(['Blégny', 'Woluwe-St.-Lambert', 'Wolvertem'])"))boundaries = ( geocode_cities(grouped_city.city) .scope("Belgium") .inc_res(4) .get_boundaries(resolution=6))( ggplot()+ geom_livemap()+ geom_polygon( aes(color="price", fill="price"), data=grouped_city,map=boundaries, color="#dbd6d6", alpha=0.9, size=0.5, map_join=[["city"], ["city"]], tooltips=layer_tooltips(["city", "price"]), show_legend=False )+ labs("Median House Prices in Belgium Cities", subtitle="Knokke-Heist boasts one of the highest median apartment prices, closely followed by Watermael-Boitsfort.", caption="https://www.immoweb.be/", )+ theme( axis_title="blank", axis_text="blank", axis_ticks="blank", axis_line="blank", plot_subtitle=element_text(size=12, face="italic"), plot_title=element_text(size=15, face="bold"), )+ scale_fill_gradient(low="green", high="red")+ labs(fill="Median Price")+ ggsize(1000, 800))```# CorrelationsFinally, we will delve into the correlations among variables with high cardinality through Spearman correlation analysis, with a particular focus on the price as the target variable. As evident from the heatmap, the price exhibits a strong correlation with cadastral income (correlation coefficient = 0.77), living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59). For your reference, _cadastral income_ is an annual Flemish property tax based on the assessed rental income of immovable properties in the Flemish Region. This income is a notional rental income assigned to each property, whether it is rented out or not.```{python}#| fig-cap: Spearman Correlations Among High Cardinality Features#| label: fig-fig6( df.loc[:, lambda df: df.columns.isin(high_cardinality_features)] .corr(method="spearman") .pipe(lambda df: corr_plot(df) .tiles("lower", ) .labels(type="lower", map_size=False) .palette_gradient(low="#2986cc", mid="#ffffff", high="#d73027") .build()+ ggsize(900, 900)+ labs( title="Spearman Correlations Among High Cardinality Features", subtitle=""" The price demonstrates robust correlations with key factors, including cadastral income (correlation coefficient = 0.77), living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59) """, x="Number of Unique Values per Feature", y="", caption="https://www.immoweb.be/", )+ theme( plot_subtitle=element_text( size=12, face="italic" ), # Customize subtitle appearance plot_title=element_text(size=15, face="bold"), # Customize title appearance ) ))```In part 3, we've performed some initial data exploration, and our next stride entails constructing a foundational machine learning model. In part 4, we'll compare various algorithms to establish a benchmark for our subsequent endeavors in model building and feature engineering. This baseline model will serve as a reference point, guiding us as we strive to enhance our predictive capabilities. Looking forward to meeting you again in Part 4!