Predict the price of a house by its features. If you are a buyer or seller of the house but you don’t know the exact price of the house, so supervised machine learning regression algorithms can help you to predict the price of the house just providing features of the target house.
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. This dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
Problem:
Dataset:
Source: The Ames Housing dataset was compiled by Dean De Cock.
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Import data set
df_raw = pd.read_csv('project_data/house_price_train.csv')
df_raw.head(10)
# Check the description
df_raw.describe(include = 'all')
# Check the shape
df_raw.shape
# Check more info
df_raw.info()
#Rearange columns
df_raw = df_raw[['SalePrice','Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu',
'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive',
'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature',
'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']]
# Show correlation in the features by using heatmap
from heatmap import corrplot
plt.figure(figsize=(12, 8))
corrplot(df_raw.corr(), size_scale=300)
# When was the property sold in which year and month?
df_raw.groupby(['YrSold', 'MoSold']).Id.count().plot(kind = 'bar', figsize = (14,4))
plt.title('When was the property sold?')
plt.show()
# Where are the property located?
df_raw.groupby(['Neighborhood']).Id.count().sort_values().plot(kind = 'bar', figsize = (14,4))
plt.title('Where are the most of the property located?')
plt.show()
# Show for SalePrice with Overallqual
plt.figure(figsize= (10, 5))
sns.barplot(df_raw['OverallQual'], df_raw['SalePrice'])
plt.show()
# Show regression plot
plt.figure(figsize= (10, 5))
sns.regplot(df_raw['GrLivArea'], df_raw['SalePrice'], scatter_kws= {'alpha': 0.3})
plt.show()
# Show SalePrice distribution
sns.distplot(df_raw['SalePrice'])
plt.show()
# Most machine learning algorithm works well with data which are normally distriuted
# Trasform the dependent variable (SalePrice) by taking the log scale
target = np.log(df_raw['SalePrice'])
sns.distplot(target)
plt.show()
# Remove the outliers (Getting the 99th percentile and keep the data below)
df_raw = df_raw[df_raw['SalePrice'] <= df_raw['SalePrice'].quantile(0.99)]
# Check for outliers
plt.figure(figsize= (10, 5))
sns.regplot(df_raw['GrLivArea'], df_raw['SalePrice'], scatter_kws= {'alpha': 0.3})
plt.show()
# Find numerical data
numeric_data = df_raw.select_dtypes(include = [np.number])
# Find categorical data
categorical_data = df_raw.select_dtypes(exclude = [np.number])
# Print the numbers of numerical and categorical
print('There are {0} numerical and {1} categorical features in the data'.format(numeric_data.shape[1], categorical_data.shape[1]))
# Check if there are duplicates
print('Dataset duplicate IDs: {}'.format(df_raw.duplicated('Id').sum()))
# Check for missing values
df_raw.isna().mean()
Columns that contains less than 5% of the observational data will be deleted since these will not share any significant value in the result. Columns that has a missing values not greater than 5% will be inserted random sample values.
Columns that has a missing values greater than 5% will still be included. In missing numerical values, it will be inserted by its median values with an added variable column to represent its missing value while the missing categorical values will be inserted a "missing value" because it may share importance in the result.
List of columns that contains missing values
Columns that will be deleted:
- 6 Alley 91 non-null object
- 72 PoolQC 7 non-null object
- 74 MiscFeature 54 non-null object
Columns that will be filled with random sample data:
- 25 MasVnrType 1452 non-null object
- 26 MasVnrArea 1452 non-null float64
- 30 BsmtQual 1423 non-null object
- 31 BsmtCond 1423 non-null object
- 32 BsmtExposure 1422 non-null object
- 32 BsmtExposure 1422 non-null object
- 33 BsmtFinType1 1423 non-null object
- 35 BsmtFinType2 1422 non-null object
- 42 Electrical 1459 non-null object
- 59 GarageYrBlt 1379 non-null float64
- 60 GarageFinish 1379 non-null object
- 63 GarageQual 1379 non-null object
- 64 GarageCond 1379 non-null object
- 58 GarageType 1379 non-null object
Columns that will be filled with median values and added a missing variable column
- 3 LotFrontage 1201 non-null float64
Columns that will be filled "missing" data in the missing categorical data
- 57 FireplaceQu 770 non-null object
- 73 Fence 281 non-null object
# Import feature engine library for data engineerng
from feature_engine import missing_data_imputers as mdi
#Create a copy
data = df_raw.copy()
# Remove columns that is less than 5% of the observational data
data = data.drop(['Alley', 'PoolQC', 'MiscFeature'], axis = 1)
# Insert random samples in data which has a missing values equal or greater than 5%
crs_imputer = mdi.RandomSampleImputer(random_state=0, variables=['MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond'])
crs_imputer.fit(data)
data_v1 = crs_imputer.transform(data)
# Add a missing variable column to represent the missing numerical values that is greater than 5%
cmav_imputer = mdi.AddMissingIndicator(variables = ['LotFrontage'])
cmav_imputer.fit(data_v1)
data_v2 = cmav_imputer.transform(data_v1)
# Insert median values in missing numerical data that is greater than 5%
median_imputer = mdi.MeanMedianImputer(imputation_method='median', variables=['LotFrontage'])
median_imputer.fit(data_v2)
data_v3 = median_imputer.transform(data_v2)
# Insert "Missing" in missing categorical data
cat_imputer = mdi.CategoricalVariableImputer(variables=['FireplaceQu', 'Fence'])
cat_imputer.fit(data_v3)
data_v4 = cat_imputer.transform(data_v3)
data_v4.head()
data_v4.select_dtypes(exclude = [np.number])
# Apply OneHotEncoder
from feature_engine.categorical_encoders import OneHotCategoricalEncoder
ohe = OneHotCategoricalEncoder(top_categories=None)
ohe.fit(data_v4)
data_v5 = ohe.transform(data_v4)
data_v5.head()
# Define the indepedent and dependent variables
x = data_v5.drop(['SalePrice', 'Id'], axis = 1)
y = np.log(data_v5['SalePrice'])
# Split the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)
x_train.shape, y_train.shape, x_test.shape, y_test.shape
# Scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
X_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)
# Create Machine Learning models
def models(X_train, y_train):
# Fitting SVR to the dataset
from sklearn.svm import SVR
svr = SVR(kernel = 'rbf')
svr.fit(X_train, y_train)
# Fitting Decision Tree Regression to the dataset
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor(random_state = 0)
tree.fit(X_train, y_train)
# Fitting Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 20, criterion = 'mse', random_state = 0)
forest.fit(X_train, y_train)
# Fitting XGBoost Regression to the dataset
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
return svr, tree, forest, xgb
model = models(X_train, y_train)
# Check Mean Absolute Error
from sklearn.metrics import mean_absolute_error
for i in range(len(model)):
y_predicted = model[i].predict(X_test)
rmse = mean_absolute_error(y_test, y_predicted)
print('model[{}] Mean Absolute Error: "{}"'.format(i, rmse))
# Check Mean Squared Error
from sklearn.metrics import mean_squared_error
from math import sqrt
for i in range(len(model)):
y_predicted = model[i].predict(X_test)
rmse = mean_squared_error(y_test, y_predicted)
print('model[{}] Mean Squared Error: "{}"'.format(i, rmse))
# Check Root Mean Squared Error
from math import sqrt
for i in range(len(model)):
y_predicted = model[i].predict(X_test)
rmse = sqrt(mean_squared_error(y_test, y_predicted))
print('model[{}] Root Mean Squared Error: "{}"'.format(i, rmse))
# Check R2 Score
from sklearn.metrics import r2_score
for i in range(len(model)):
y_predicted = model[i].predict(X_test)
rmse = r2_score(y_test, y_predicted)
print('model[{}] R2 Score: "{}"'.format(i, rmse))
# Implement-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model[3], X = X_train, y = y_train, cv = 10)
print('Mean Accuracy:', accuracies.mean())
print('Standard Deviation:', accuracies.std())
# Get feature importance
importances = pd.DataFrame({'feature': x.columns, 'importance': np.round(model[3].feature_importances_, 3)})
importances = importances.sort_values('importance', ascending = False).set_index('feature')
print(importances.head(10))
# Visualize the importance
importances.head(10).plot.bar()
plt.show()
There are several factors that influence the price a buyer is willing to pay for a house. Some are apparent and obvious and some are not. Nevertheless, a rational approach facilitated by machine learning can be very useful in predicting the house price.
A large data set with 79 different features (like living area, number of rooms, location etc) along with their prices are provided for residential homes in Ames, Iowa. We obtained an accuracy of 87% by using XGboost model, we learn relationship between the important features and the price and use it to predict the prices of a new set of houses.