Building a California Housing Price Prediction Model Using Gradient Boosting and Feature Selection: A Comprehensive Guide

In this project, we aim to build a robust machine learning model to predict house prices in California using the California Housing dataset. By utilizing a powerful algorithm like Gradient Boosting Regressor and applying advanced techniques such as Feature Selection and Hyperparameter Tuning, we enhance the model’s predictive performance. We will also focus on key preprocessing steps, such as log transformation to handle skewed target data, standardization to normalize feature values, and Recursive Feature Elimination (RFE) to select the most influential features. Through this step-by-step guide, you’ll learn how to optimize a predictive model and evaluate its performance using metrics like Mean Squared Error (MSE), R-squared, and Mean Absolute Error (MAE). This project will give you hands-on experience in predictive modeling and model optimization, helping you tackle real-world regression problems effectively.

1. Loading and Preparing the Dataset

1.1 Dataset Used

The California Housing dataset is loaded using the fetch_california_housing function.

First few rows output: The code prints the first few rows of the dataset to give an overview of the data. This dataset includes features like MedInc (Median Income), HouseAge, AveRooms (Average Rooms), and the target variable Price which represents the median house value in different districts.

The California Housing dataset is a well-known dataset often used for regression tasks in machine learning, particularly in predicting house prices based on various features. It was originally derived from the 1990 U.S. Census and contains data on housing values in various districts of California. The goal is to predict the median house value for California districts, using features such as median income, average rooms per household, and geographic location.

1.2 Key Features in the California Housing Dataset:

  1. MedInc: Median income of the district’s population.
  2. HouseAge: Median age of the houses in the district.
  3. AveRooms: Average number of rooms per household.
  4. AveBedrms: Average number of bedrooms per household.
  5. Population: Total population of the district.
  6. AveOccup: Average number of occupants per household.
  7. Latitude: Latitude of the district (geographical location).
  8. Longitude: Longitude of the district (geographical location).

1.3 Target Variable:

  • Price: Median house value for the district (in hundreds of thousands of dollars).

This dataset is widely used for machine learning tasks, particularly in demonstrating regression algorithms, model evaluation, and hyperparameter tuning. It contains around 20,000 instances and is a perfect dataset to practice and learn real-world price prediction using various features related to demographics, housing conditions, and geography.

1.4 Code Block:

housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target
print(df.head())

1.5 Output:

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  Price
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23  4.526
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22  3.585

2. Log-Transformation of Target Variable

2.1 Why log transformation?

The housing prices might have extreme values (outliers), so the log transformation helps normalize the data and reduce the impact of very high values. The np.log1p function applies a natural log transformation to Price.

Let’s break down log transformation in a simple way with a real-life example.

2.2 What is Log Transformation?

Think of log transformation as a tool to shrink large numbers to make them more manageable, without losing their relative importance. This is especially helpful when you’re dealing with data that has extremely large values that can mess up your model’s ability to learn.

2.3 Real-Life Example: House Prices

Imagine you’re a real estate agent and you have a list of house prices in your city. Most houses cost around ₹50 lakhs to ₹1 crore, but a few luxury homes cost ₹50 crores or more. These very high-priced homes are outliers—they’re much higher than the typical house prices. If you want to build a model to predict house prices, these extreme values can make it harder for the model to learn because it gets too focused on those high-priced homes.

2.4 Why is This a Problem?

When you have such huge differences in values, the model might not understand how to predict prices for “regular” houses (e.g., ₹50 lakhs) because it’s distracted by the few super-expensive homes (₹50 crores). In other words, the model might overestimate or underestimate prices for the majority of houses.

2.5 How Log Transformation Helps

Log transformation helps by compressing these large prices into a smaller, more reasonable range. This doesn’t change the relationships between prices—it just makes the numbers easier for the model to handle.

For example:

  • Without log transformation, house prices might look like this:
    • ₹50 lakhs, ₹1 crore, ₹2 crores, ₹50 crores
  • After log transformation, the prices might look more like this:
    • 3.91, 4.61, 5.30, 7.82 (these are the log-transformed values)

Now, the extreme value (₹50 crores) is closer to the rest of the values, making it easier for the model to focus on the general patterns of regular homes, without being skewed by the very high prices.

2.6 Returning to Normal Prices

Once the model makes predictions, you can reverse the log transformation to get the actual house prices. This way, you get predictions in the original scale of values (e.g., ₹1 crore instead of 4.61).

2.7 Code Block:

df['Price'] = np.log1p(df['Price'])
print(df.head())

2.8 Output after transformation:

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude     Price
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23  1.709464

3. Data Standardization

Why standardization?: Features like MedInc and AveRooms are on different scales. Standardizing ensures that all features are centered around 0 with unit variance, which helps improve model performance.

3.1 Why Standardization?

In machine learning, standardization is an important step when you’re working with features that are on different scales. It involves transforming the data so that all features have the same scale, typically with a mean of 0 and a standard deviation of 1. This is especially useful when some features have very large values compared to others.

3.2 Real-Life Example

Imagine you’re trying to predict house prices using two features:

  • Median Income (MedInc): This could range from ₹30,000 to ₹1,00,000 per month.
  • Average Rooms per Household (AveRooms): This might range from 1 to 10 rooms.

In this case:

  • MedInc has values in the tens of thousands (₹), while AveRooms only ranges from 1 to 10.

3.3 Why is This a Problem?

Without standardization, the machine learning algorithm might:

  1. Favor features with larger numbers: Since MedInc has larger values, the model might give it more importance just because the numbers are bigger, even if AveRooms is equally or more important in predicting house prices.
  2. Harder for the algorithm to learn: Many machine learning algorithms, like Gradient Boosting, Logistic Regression, and Neural Networks, assume that all features are on a similar scale. If one feature has much larger values, the algorithm struggles to weigh the importance of features correctly.

3.4 How Standardization Helps

When you standardize the data:

  1. Each feature gets transformed to have a mean of 0 and a standard deviation of 1. This means the values of all features are now on a similar scale.
  2. The formula for standardization is
  3. This centers the data around 0 (subtracting the mean) and adjusts for the spread (dividing by standard deviation).

3.5 Example of Standardization

Before standardization:

  • MedInc might range from ₹30,000 to ₹1,00,000.
  • AveRooms might range from 1 to 10.

After standardization:

  • MedInc might now range from -2 to 2.
  • AveRooms might now range from -1.5 to 1.5.

Both features are now on a similar scale, and the model can focus on the relationships between the features and the target variable (price), rather than the magnitude of their numbers.

3.6 Key Benefits:

  1. Balanced feature importance: The model won’t favor features with larger values just because they have bigger numbers.
  2. Improves model performance: Algorithms that rely on distance measures, like Gradient Boosting or KNN, work much better when features are on the same scale. It helps the algorithm converge faster and gives better results.
  3. Prevents bias in learning: The algorithm learns the true relationships between features and the target variable, not just based on the magnitude of the numbers.

3.7 Code Block:

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop('Price', axis=1))

4. Data Splitting

  • Why split the data?: The data is split into training and testing sets. We use 80% of the data for training and 20% for testing to evaluate the model performance.

4.1 Code Block:

X_train, X_test, y_train, y_test = train_test_split(X_scaled, df['Price'], test_size=0.2, random_state=42)

5. Feature Selection using Recursive Feature Elimination (RFE)

What is RFE?: RFE selects the top 5 features that are most important for the model.

Recursive Feature Elimination (RFE) is a method used to automatically select the most important features from a dataset for building a machine learning model. It helps in reducing the number of features, making the model simpler and often more accurate.

5.1 Why Do We Need Feature Selection?

When you have a lot of features (data columns), not all of them might be important for predicting the target (like house price). Some features might be irrelevant or redundant, which can:

  • Slow down the model.
  • Cause overfitting, where the model performs well on training data but poorly on new data.
  • Reduce model accuracy by introducing noise.

To prevent this, feature selection helps you choose only the most relevant features.

5.2 What is RFE?

Recursive Feature Elimination (RFE) is one of the most popular methods to select important features. Here’s how it works in simple steps:

  1. Train the model: RFE starts by training the model using all features.
  2. Rank features: It ranks the features based on how important they are for the prediction.
  3. Remove the least important feature: The least important feature is removed.
  4. Repeat the process: The model is retrained with one less feature, and this process is repeated until the desired number of features is left.

At the end, you’re left with the top N features that are the most important for the model.

5.3 Real-Life Example

Let’s say you’re predicting house prices using several features like:

  • Income (MedInc)
  • Average number of rooms (AveRooms)
  • Latitude
  • Longitude
  • Population

Not all of these features may be equally important. RFE helps to eliminate the least important ones. For instance, maybe Population doesn’t have much effect on house prices, so RFE might remove it.

5.4 How RFE Works Step by Step:

  1. Start with all features: RFE trains the model using all the features and checks how much each feature contributes to the predictions.
  2. Remove the weakest feature: It removes the feature that has the least impact on the predictions.
  3. Repeat the process: The process is repeated, removing one feature at a time and retraining the model, until you’re left with the desired number of most important features.
  4. Final set of features: You end up with a smaller set of the most important features for making predictions.

5.5 In this example:

  • GradientBoostingRegressor is the model used to rank the features.
  • RFE is set to select the top 5 features.
  • The result will show the most important 5 features based on the model.

5.6 Code Block:

selector = RFE(model, n_features_to_select=5)
X_train_selected = selector.fit_transform(X_train, y_train)
selected_features = df.drop('Price', axis=1).columns[selector.support_]
print(selected_features)

5.7 Output:

Selected features: Index(['MedInc', 'AveRooms', 'AveOccup', 'Latitude', 'Longitude'], dtype='object')

6. Hyperparameter Tuning using GridSearchCV

6.1 Why GridSearchCV?

It finds the best hyperparameters (n_estimators, learning_rate, max_depth) for the GradientBoostingRegressor model.

6.2 What are Hyperparameters?

Hyperparameters are settings that you specify before training the model. Unlike model parameters (which are learned from the data), hyperparameters are not learned and need to be set manually. Examples include:

  • Learning rate: Controls how quickly the model adapts to new data.
  • Number of estimators: The number of trees in a random forest or boosting model.
  • Max depth: Maximum depth of trees in decision trees or boosting models.

Choosing the right combination of hyperparameters can make the difference between an average model and a highly accurate one.

6.3 What is GridSearchCV?

GridSearchCV is a tool that helps automate hyperparameter tuning. It searches through different combinations of hyperparameters and finds the set that gives the best model performance based on a scoring metric (e.g., accuracy, mean squared error).

6.4 How GridSearchCV Works:

  1. Define a range of hyperparameters: You specify the hyperparameters you want to tune and the possible values for each.
  2. Try all combinations: GridSearchCV tries every possible combination of these hyperparameters.
  3. Train and evaluate the model: For each combination, the model is trained and evaluated using cross-validation. This means it splits the training data into multiple parts and trains the model multiple times to ensure the results are robust.
  4. Select the best combination: After trying all combinations, GridSearchCV tells you which set of hyperparameters worked best, based on the chosen metric (e.g., accuracy, MSE).

6.5 Real-Life Example

Let’s say you’re using Gradient Boosting Regressor to predict house prices, and you want to tune three hyperparameters:

  • n_estimators: Number of boosting stages (like the number of trees).
  • learning_rate: How fast the model learns.
  • max_depth: Maximum depth of each tree.

You’re not sure which values will give the best results, so you try a range of values for each hyperparameter:

  • n_estimators: [100, 200, 300]
  • learning_rate: [0.01, 0.05, 0.1]
  • max_depth: [3, 4, 5]

Instead of trying each combination manually, GridSearchCV will automate this for you.

6.6 Step-by-Step Breakdown:

  1. Initialize the model: We are using GradientBoostingRegressor.
  2. Define the grid: We create a param_grid with all the possible values for n_estimators, learning_rate, and max_depth.
  3. GridSearchCV setup: We set up GridSearchCV to perform 5-fold cross-validation (cv=5). This means it will split the data into 5 parts and train the model 5 times for each combination of hyperparameters to ensure accuracy.
  4. Fit the model: GridSearchCV tries all combinations of hyperparameters and trains the model on the training data.
  5. Best hyperparameters: After evaluating all combinations, GridSearchCV provides the best set of hyperparameters.

6.7 Why Use GridSearchCV?

  1. Saves time: Instead of manually trying every combination, GridSearchCV automates this process.
  2. Cross-validation ensures robustness: GridSearchCV uses cross-validation to ensure the selected hyperparameters generalize well to new data, preventing overfitting.
  3. Finds the optimal set: By testing multiple combinations, it helps find the best set of hyperparameters that maximize model performance.

6.8 When to Use GridSearchCV:

  • When you’re unsure which hyperparameters to choose.
  • When you want to improve your model’s accuracy by fine-tuning its performance.

6.9 Code Block:

param_grid = {'n_estimators': [100, 200, 300], 'learning_rate': [0.01, 0.05, 0.1], 'max_depth': [3, 4, 5]}
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train_selected, y_train)
print(grid_search.best_params_)

6.10 Output:

Best Parameters from GridSearchCV: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}

7. Model Training and Predictions

Training: The model is trained with the best hyperparameters found through GridSearchCV.

7.1 Code Block:

best_model = grid_search.best_estimator_
best_model.fit(X_train_selected, y_train)
y_pred = best_model.predict(X_test_selected)

8. Performance Evaluation (MSE, R-squared, MAE)

Once you’ve built and trained your machine learning model, it’s important to evaluate its performance. For regression models (where you’re predicting a continuous value like house prices), there are a few key metrics that help you understand how well your model is doing. Three of the most commonly used metrics are:

  1. Mean Squared Error (MSE)
  2. R-squared (R²)
  3. Mean Absolute Error (MAE)

Let’s break them down in simple terms:

8.1 Mean Squared Error (MSE)

MSE tells you how far off your model’s predictions are from the actual values, on average. It’s called squared error because it squares the difference between predicted and actual values. This makes sure that large errors (big differences between actual and predicted values) are given more weight than smaller errors.

8.1.1 How MSE Works:
  • For each prediction, you calculate the difference between the actual value and the predicted value, square that difference, and then average all these squared differences.
8.1.2 Example in Real Life:

If your model is predicting house prices, and one house was actually ₹50 lakhs, but your model predicted ₹60 lakhs, the squared error for that house would be (60 – 50)² = 100. Larger errors, like predicting ₹1 crore when the actual price was ₹50 lakhs, result in a much higher squared error.

  • Lower MSE is better: It means your model’s predictions are closer to the actual values.

8.2 R-squared (R²)

R-squared measures how well your model’s predictions match the actual data. It tells you what percentage of the variation in the target variable (e.g., house prices) can be explained by your model.

8.2.1 Key Points:
  • R² ranges from 0 to 1:
    • 1 means your model perfectly explains the data (100% accurate).
    • 0 means your model does not explain the data at all.
    • Sometimes R² can be negative if the model is really bad.
8.2.2 Example in Real Life:

If your R² is 0.80, that means your model explains 80% of the variance in house prices, and the remaining 20% is due to other factors or noise that your model couldn’t capture.

  • Higher R² is better: It indicates your model is doing a good job of predicting the target.

8.3 Mean Absolute Error (MAE)

MAE is similar to MSE, but instead of squaring the errors, it takes the absolute difference between the predicted and actual values. This metric gives you the average error in the units of your target variable (e.g., house prices in ₹).

8.3.1 How MAE Works:
  • For each prediction, you calculate the absolute difference between the actual value and the predicted value, then average all the differences.
8.3.2 Example in Real Life:

If your model predicted ₹60 lakhs but the actual price was ₹50 lakhs, the absolute error is |60 – 50| = 10 lakhs. MAE gives you an easy-to-understand number that shows the average error in your predictions.

  • Lower MAE is better: It means, on average, your model is predicting closer to the actual values.

8.4 When to Use MSE, R-squared, or MAE?

  • MSE is useful if you want to penalize larger errors more than smaller ones because squaring the errors gives more weight to larger mistakes.
  • R-squared is great if you want to know how well your model explains the overall variation in the data. It’s good for evaluating the goodness-of-fit of the model.
  • MAE is simpler and gives you the average size of the error in the same unit as the target variable, so it’s easier to interpret when you want to know how far off your predictions are on average.

8.5 Code Block:

mse_actual = mean_squared_error(y_test_exp, y_pred_exp)
r2 = r2_score(y_test_exp, y_pred_exp)
mae = mean_absolute_error(y_test_exp, y_pred_exp)
print("MSE:", mse_actual)
print("R-squared:", r2)
print("MAE:", mae)

8.6 Output:

Mean Squared Error (MSE) on original price scale: 0.2263
R-squared: 0.8273
Mean Absolute Error (MAE): 0.3087

9. Example Predictions

Example output: The code prints a few predictions alongside actual values for interpretability.

9.1 Code Block:

for i in range(5):
    print(f"Predicted Price: {y_pred_exp[i]:.2f}, Actual Price: {y_test_exp.iloc[i]:.2f}")

9.2 Output:

Predicted Price: 0.52, Actual Price: 0.48
Predicted Price: 0.93, Actual Price: 0.46

10. Handling Outliers

Outliers: The Interquartile Range (IQR) method is used to detect and optionally remove outliers.

Outliers are data points that are significantly different from the rest of your data. They might be extremely high or extremely low values compared to the typical data points, and they can impact your model’s performance by skewing results, reducing accuracy, and leading to overfitting.

10.1 Code Block:

Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
outlier_threshold = 1.5 * IQR
outliers = df[(df['Price'] < (Q1 - outlier_threshold)) | (df['Price'] > (Q3 + outlier_threshold))]
print(len(outliers))

10.2 Output:

Number of detected outliers: 0

Complete code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.feature_selection import RFE
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target

# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())

# Log-transform the target variable (Price) to reduce the effect of large values
df['Price'] = np.log1p(df['Price'])

# Display the data after log transformation
print("\nData after log-transforming the target (Price):")
print(df.head())

# Feature Engineering: Standardize features (except the target variable)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop('Price', axis=1))

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, df['Price'], test_size=0.2, random_state=42)

# Initialize the Gradient Boosting Regressor model
model = GradientBoostingRegressor()

# Feature Selection using Recursive Feature Elimination (RFE)
selector = RFE(model, n_features_to_select=5)  # Select top 5 features
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the selected features
selected_features = df.drop('Price', axis=1).columns[selector.support_]
print("\nSelected features:", selected_features)

# Hyperparameter tuning with GridSearchCV for Gradient Boosting Regressor
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5]
}

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train_selected, y_train)

# Get the best model from grid search
best_model = grid_search.best_estimator_
print("\nBest Parameters from GridSearchCV:", grid_search.best_params_)

# Train the best model on the selected features
best_model.fit(X_train_selected, y_train)

# Predict on the test set
y_pred = best_model.predict(X_test_selected)

# Calculate Mean Squared Error (MSE) in log-transformed space
mse_log_transformed = mean_squared_error(np.expm1(y_test), np.expm1(y_pred))  # Transform predictions back from log scale
print("\nMean Squared Error (MSE) on original price scale:", mse_log_transformed)

# Cross-validation to check for model robustness
mse_scores = cross_val_score(best_model, X_train_selected, y_train, cv=5, scoring='neg_mean_squared_error')
mse_mean = -mse_scores.mean()
print("\nCross-validated MSE on log-transformed data:", mse_mean)

# Transform predictions back from log scale to original price scale for better interpretability
y_pred_exp = np.expm1(y_pred)  # Undo log1p (expm1 is the reverse of log1p)
y_test_exp = np.expm1(y_test)

# Calculate Mean Squared Error (MSE) on original scale
mse_actual = mean_squared_error(y_test_exp, y_pred_exp)
print("\nActual Mean Squared Error (MSE) on original price scale:", mse_actual)

# Calculate R-squared
r2 = r2_score(y_test_exp, y_pred_exp)
print("R-squared:", r2)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test_exp, y_pred_exp)
print("Mean Absolute Error (MAE):", mae)

# Print a few example predictions
print("\nExample predictions:")
for i in range(5):
    print(f"Predicted Price: {y_pred_exp[i]:.2f}, Actual Price: {y_test_exp.iloc[i]:.2f}")

# Detecting and Handling Outliers (optional)
# Define a threshold for outlier detection (e.g., based on interquartile range (IQR))
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
outlier_threshold = 1.5 * IQR

# Filter out the outliers
outliers = df[(df['Price'] < (Q1 - outlier_threshold)) | (df['Price'] > (Q3 + outlier_threshold))]
print(f"\nNumber of detected outliers: {len(outliers)}")

# Optionally, you could remove these outliers from your training set
df_filtered = df[~((df['Price'] < (Q1 - outlier_threshold)) | (df['Price'] > (Q3 + outlier_threshold)))]

# After removing outliers, you could re-train the model
# ... (Repeat scaling, feature selection, and model training steps with the filtered data)
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x