Case Study: House Prices Prediction

Girl in a jacket

Introduction

Predict the price of a house by its features. If you are a buyer or seller of the house but you don’t know the exact price of the house, so supervised machine learning regression algorithms can help you to predict the price of the house just providing features of the target house.

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. This dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

Problem:

  • Predict the final price of each home

Dataset:

  • SalePrice - the property's sale price in dollars.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: $Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale

Source: The Ames Housing dataset was compiled by Dean De Cock.

Libraries and Dataset Importation

In [35]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()

Dataset Exploration

In [36]:
# Import data set
df_raw = pd.read_csv('project_data/house_price_train.csv')
df_raw.head(10)
Out[36]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
5 6 50 RL 85.0 14115 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv Shed 700 10 2009 WD Normal 143000
6 7 20 RL 75.0 10084 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 307000
7 8 60 RL NaN 10382 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Shed 350 11 2009 WD Normal 200000
8 9 50 RM 51.0 6120 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2008 WD Abnorml 129900
9 10 190 RL 50.0 7420 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 WD Normal 118000

10 rows × 81 columns

In [37]:
# Check the description
df_raw.describe(include = 'all')
Out[37]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
count 1460.000000 1460.000000 1460 1201.000000 1460.000000 1460 91 1460 1460 1460 ... 1460.000000 7 281 54 1460.000000 1460.000000 1460.000000 1460 1460 1460.000000
unique NaN NaN 5 NaN NaN 2 2 4 4 2 ... NaN 3 4 4 NaN NaN NaN 9 6 NaN
top NaN NaN RL NaN NaN Pave Grvl Reg Lvl AllPub ... NaN Gd MnPrv Shed NaN NaN NaN WD Normal NaN
freq NaN NaN 1151 NaN NaN 1454 50 925 1311 1459 ... NaN 3 157 49 NaN NaN NaN 1267 1198 NaN
mean 730.500000 56.897260 NaN 70.049958 10516.828082 NaN NaN NaN NaN NaN ... 2.758904 NaN NaN NaN 43.489041 6.321918 2007.815753 NaN NaN 180921.195890
std 421.610009 42.300571 NaN 24.284752 9981.264932 NaN NaN NaN NaN NaN ... 40.177307 NaN NaN NaN 496.123024 2.703626 1.328095 NaN NaN 79442.502883
min 1.000000 20.000000 NaN 21.000000 1300.000000 NaN NaN NaN NaN NaN ... 0.000000 NaN NaN NaN 0.000000 1.000000 2006.000000 NaN NaN 34900.000000
25% 365.750000 20.000000 NaN 59.000000 7553.500000 NaN NaN NaN NaN NaN ... 0.000000 NaN NaN NaN 0.000000 5.000000 2007.000000 NaN NaN 129975.000000
50% 730.500000 50.000000 NaN 69.000000 9478.500000 NaN NaN NaN NaN NaN ... 0.000000 NaN NaN NaN 0.000000 6.000000 2008.000000 NaN NaN 163000.000000
75% 1095.250000 70.000000 NaN 80.000000 11601.500000 NaN NaN NaN NaN NaN ... 0.000000 NaN NaN NaN 0.000000 8.000000 2009.000000 NaN NaN 214000.000000
max 1460.000000 190.000000 NaN 313.000000 215245.000000 NaN NaN NaN NaN NaN ... 738.000000 NaN NaN NaN 15500.000000 12.000000 2010.000000 NaN NaN 755000.000000

11 rows × 81 columns

In [38]:
# Check the shape
df_raw.shape
Out[38]:
(1460, 81)
In [39]:
# Check more info
df_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

Data Visualization

In [40]:
#Rearange columns
df_raw = df_raw[['SalePrice','Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu',
       'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
       'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive',
       'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature',
       'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']]
In [41]:
# Show correlation in the features by using heatmap
from heatmap import corrplot
plt.figure(figsize=(12, 8))
corrplot(df_raw.corr(), size_scale=300)
In [42]:
# When was the property sold in which year and month?
df_raw.groupby(['YrSold', 'MoSold']).Id.count().plot(kind = 'bar', figsize = (14,4))
plt.title('When was the property sold?')
plt.show()
In [43]:
# Where are the property located?
df_raw.groupby(['Neighborhood']).Id.count().sort_values().plot(kind = 'bar', figsize = (14,4))
plt.title('Where are the most of the property located?')
plt.show()
In [44]:
# Show for SalePrice with Overallqual
plt.figure(figsize= (10, 5))
sns.barplot(df_raw['OverallQual'], df_raw['SalePrice'])
plt.show()
In [45]:
# Show regression plot
plt.figure(figsize= (10, 5))
sns.regplot(df_raw['GrLivArea'], df_raw['SalePrice'], scatter_kws= {'alpha': 0.3})
plt.show()
In [46]:
# Show SalePrice distribution
sns.distplot(df_raw['SalePrice'])
plt.show()

Feature Engineering

Mathematical Transformation and Removing Outliers

In [47]:
# Most machine learning algorithm works well with data which are normally distriuted
# Trasform the dependent variable (SalePrice) by taking the log scale
target = np.log(df_raw['SalePrice'])
sns.distplot(target)
plt.show()
In [48]:
# Remove the outliers (Getting the 99th percentile and keep the data below)
df_raw = df_raw[df_raw['SalePrice'] <= df_raw['SalePrice'].quantile(0.99)]
In [49]:
# Check for outliers
plt.figure(figsize= (10, 5))
sns.regplot(df_raw['GrLivArea'], df_raw['SalePrice'], scatter_kws= {'alpha': 0.3})
plt.show()

Check Features

In [50]:
# Find numerical data
numeric_data = df_raw.select_dtypes(include = [np.number])

# Find categorical data
categorical_data = df_raw.select_dtypes(exclude = [np.number])

# Print the numbers of numerical and categorical
print('There are {0} numerical and {1} categorical features in the data'.format(numeric_data.shape[1], categorical_data.shape[1]))

# Check if there are duplicates
print('Dataset duplicate IDs: {}'.format(df_raw.duplicated('Id').sum()))
There are 38 numerical and 43 categorical features in the data
Dataset duplicate IDs: 0
In [51]:
# Check for missing values
df_raw.isna().mean()
Out[51]:
SalePrice        0.000000
Id               0.000000
MSSubClass       0.000000
MSZoning         0.000000
LotFrontage      0.178547
                   ...   
MiscVal          0.000000
MoSold           0.000000
YrSold           0.000000
SaleType         0.000000
SaleCondition    0.000000
Length: 81, dtype: float64

Missing Data Imputation

Columns that contains less than 5% of the observational data will be deleted since these will not share any significant value in the result. Columns that has a missing values not greater than 5% will be inserted random sample values.

Columns that has a missing values greater than 5% will still be included. In missing numerical values, it will be inserted by its median values with an added variable column to represent its missing value while the missing categorical values will be inserted a "missing value" because it may share importance in the result.

List of columns that contains missing values

  • 3 LotFrontage 1201 non-null float64
  • 6 Alley 91 non-null object
  • 25 MasVnrType 1452 non-null object
  • 26 MasVnrArea 1452 non-null float64
  • 30 BsmtQual 1423 non-null object
  • 31 BsmtCond 1423 non-null object
  • 32 BsmtExposure 1422 non-null object
  • 33 BsmtFinType1 1423 non-null object
  • 35 BsmtFinType2 1422 non-null object
  • 42 Electrical 1459 non-null object
  • 57 FireplaceQu 770 non-null object
  • 58 GarageType 1379 non-null object
  • 59 GarageYrBlt 1379 non-null float64
  • 60 GarageFinish 1379 non-null object
  • 63 GarageQual 1379 non-null object
  • 64 GarageCond 1379 non-null object
  • 72 PoolQC 7 non-null object
  • 73 Fence 281 non-null object
  • 74 MiscFeature 54 non-null object

Columns that will be deleted:

- 6   Alley          91 non-null     object 
- 72  PoolQC         7 non-null      object
- 74  MiscFeature    54 non-null     object

Columns that will be filled with random sample data:

- 25  MasVnrType     1452 non-null   object
- 26  MasVnrArea     1452 non-null   float64
- 30  BsmtQual       1423 non-null   object
- 31  BsmtCond       1423 non-null   object
- 32  BsmtExposure   1422 non-null   object
- 32  BsmtExposure   1422 non-null   object 
- 33  BsmtFinType1   1423 non-null   object 
- 35  BsmtFinType2   1422 non-null   object
- 42  Electrical     1459 non-null   object 
- 59  GarageYrBlt    1379 non-null   float64
- 60  GarageFinish   1379 non-null   object 
- 63  GarageQual     1379 non-null   object 
- 64  GarageCond     1379 non-null   object
- 58  GarageType     1379 non-null   object

Columns that will be filled with median values and added a missing variable column

- 3   LotFrontage    1201 non-null   float64

Columns that will be filled "missing" data in the missing categorical data

- 57  FireplaceQu    770 non-null    object
- 73  Fence          281 non-null    object
In [52]:
# Import feature engine library for data engineerng
from feature_engine import missing_data_imputers as mdi

#Create a copy
data = df_raw.copy()

# Remove columns that is less than 5% of the observational data
data = data.drop(['Alley', 'PoolQC', 'MiscFeature'], axis = 1)

# Insert random samples in data which has a missing values equal or greater than 5%
crs_imputer = mdi.RandomSampleImputer(random_state=0, variables=['MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond'])
crs_imputer.fit(data)
data_v1 = crs_imputer.transform(data)

# Add a missing variable column to represent the missing numerical values that is greater than 5%
cmav_imputer = mdi.AddMissingIndicator(variables = ['LotFrontage'])
cmav_imputer.fit(data_v1)
data_v2 = cmav_imputer.transform(data_v1)

# Insert median values in missing numerical data that is greater than 5%
median_imputer = mdi.MeanMedianImputer(imputation_method='median', variables=['LotFrontage'])
median_imputer.fit(data_v2)
data_v3 = median_imputer.transform(data_v2)

# Insert "Missing" in missing categorical data
cat_imputer = mdi.CategoricalVariableImputer(variables=['FireplaceQu', 'Fence'])
cat_imputer.fit(data_v3)
data_v4 = cat_imputer.transform(data_v3)
In [53]:
data_v4.head()
Out[53]:
SalePrice Id MSSubClass MSZoning LotFrontage LotArea Street LotShape LandContour Utilities ... 3SsnPorch ScreenPorch PoolArea Fence MiscVal MoSold YrSold SaleType SaleCondition LotFrontage_na
0 208500 1 60 RL 65.0 8450 Pave Reg Lvl AllPub ... 0 0 0 Missing 0 2 2008 WD Normal 0
1 181500 2 20 RL 80.0 9600 Pave Reg Lvl AllPub ... 0 0 0 Missing 0 5 2007 WD Normal 0
2 223500 3 60 RL 68.0 11250 Pave IR1 Lvl AllPub ... 0 0 0 Missing 0 9 2008 WD Normal 0
3 140000 4 70 RL 60.0 9550 Pave IR1 Lvl AllPub ... 0 0 0 Missing 0 2 2006 WD Abnorml 0
4 250000 5 60 RL 84.0 14260 Pave IR1 Lvl AllPub ... 0 0 0 Missing 0 12 2008 WD Normal 0

5 rows × 79 columns

Categorical Variable Encoding

In [54]:
data_v4.select_dtypes(exclude = [np.number])
Out[54]:
MSZoning Street LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 ... Functional FireplaceQu GarageType GarageFinish GarageQual GarageCond PavedDrive Fence SaleType SaleCondition
0 RL Pave Reg Lvl AllPub Inside Gtl CollgCr Norm Norm ... Typ Missing Attchd RFn TA TA Y Missing WD Normal
1 RL Pave Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm ... Typ TA Attchd RFn TA TA Y Missing WD Normal
2 RL Pave IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm ... Typ TA Attchd RFn TA TA Y Missing WD Normal
3 RL Pave IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm ... Typ Gd Detchd Unf TA TA Y Missing WD Abnorml
4 RL Pave IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm ... Typ TA Attchd RFn TA TA Y Missing WD Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 RL Pave Reg Lvl AllPub Inside Gtl Gilbert Norm Norm ... Typ TA Attchd RFn TA TA Y Missing WD Normal
1456 RL Pave Reg Lvl AllPub Inside Gtl NWAmes Norm Norm ... Min1 TA Attchd Unf TA TA Y MnPrv WD Normal
1457 RL Pave Reg Lvl AllPub Inside Gtl Crawfor Norm Norm ... Typ Gd Attchd RFn TA TA Y GdPrv WD Normal
1458 RL Pave Reg Lvl AllPub Inside Gtl NAmes Norm Norm ... Typ Missing Attchd Unf TA TA Y Missing WD Normal
1459 RL Pave Reg Lvl AllPub Inside Gtl Edwards Norm Norm ... Typ Missing Attchd Fin TA TA Y Missing WD Normal

1445 rows × 40 columns

In [55]:
# Apply OneHotEncoder
from feature_engine.categorical_encoders import OneHotCategoricalEncoder

ohe = OneHotCategoricalEncoder(top_categories=None)
ohe.fit(data_v4)
data_v5 = ohe.transform(data_v4)
In [56]:
data_v5.head()
Out[56]:
SalePrice Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea ... SaleType_CWD SaleType_ConLw SaleType_Con SaleType_Oth SaleCondition_Normal SaleCondition_Abnorml SaleCondition_Partial SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family
0 208500 1 60 65.0 8450 7 5 2003 2003 196.0 ... 0 0 0 0 1 0 0 0 0 0
1 181500 2 20 80.0 9600 6 8 1976 1976 0.0 ... 0 0 0 0 1 0 0 0 0 0
2 223500 3 60 68.0 11250 7 5 2001 2002 162.0 ... 0 0 0 0 1 0 0 0 0 0
3 140000 4 70 60.0 9550 7 5 1915 1970 0.0 ... 0 0 0 0 0 1 0 0 0 0
4 250000 5 60 84.0 14260 8 5 2000 2000 350.0 ... 0 0 0 0 1 0 0 0 0 0

5 rows × 284 columns

Data Pre-processing

In [57]:
# Define the indepedent and dependent variables
x = data_v5.drop(['SalePrice', 'Id'], axis = 1)
y = np.log(data_v5['SalePrice'])
In [58]:
# Split the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)
x_train.shape, y_train.shape, x_test.shape, y_test.shape
Out[58]:
((1156, 282), (1156,), (289, 282), (289,))
In [59]:
# Scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
X_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)

Model creation

In [60]:
# Create Machine Learning models
def models(X_train, y_train):
    
    # Fitting SVR to the dataset
    from sklearn.svm import SVR
    svr = SVR(kernel = 'rbf')
    svr.fit(X_train, y_train)
    
    # Fitting Decision Tree Regression to the dataset
    from sklearn.tree import DecisionTreeRegressor
    tree = DecisionTreeRegressor(random_state = 0)
    tree.fit(X_train, y_train)
    
    # Fitting Random Forest Regression to the dataset
    from sklearn.ensemble import RandomForestRegressor
    forest = RandomForestRegressor(n_estimators = 20, criterion = 'mse', random_state = 0)
    forest.fit(X_train, y_train)
    
    # Fitting XGBoost Regression to the dataset
    from xgboost import XGBRegressor
    xgb = XGBRegressor()
    xgb.fit(X_train, y_train)
        
    return svr, tree, forest, xgb

model = models(X_train, y_train)

Model Evaluation

In [61]:
# Check Mean Absolute Error
from sklearn.metrics import mean_absolute_error

for i in range(len(model)):
    y_predicted = model[i].predict(X_test)

    rmse = mean_absolute_error(y_test, y_predicted)
    
    print('model[{}] Mean Absolute Error: "{}"'.format(i, rmse))
model[0] Mean Absolute Error: "0.12544607717202494"
model[1] Mean Absolute Error: "0.1335431310233288"
model[2] Mean Absolute Error: "0.09739699530831145"
model[3] Mean Absolute Error: "0.09727659023850078"
In [62]:
# Check Mean Squared Error
from sklearn.metrics import mean_squared_error
from math import sqrt

for i in range(len(model)):
    y_predicted = model[i].predict(X_test)

    rmse = mean_squared_error(y_test, y_predicted)
    
    print('model[{}] Mean Squared Error: "{}"'.format(i, rmse))
model[0] Mean Squared Error: "0.04271881161705415"
model[1] Mean Squared Error: "0.039165079343431566"
model[2] Mean Squared Error: "0.02313194023434092"
model[3] Mean Squared Error: "0.020585188884537663"
In [63]:
# Check Root Mean Squared Error
from math import sqrt

for i in range(len(model)):
    y_predicted = model[i].predict(X_test)

    rmse = sqrt(mean_squared_error(y_test, y_predicted))
    
    print('model[{}] Root Mean Squared Error: "{}"'.format(i, rmse))
model[0] Root Mean Squared Error: "0.2066852960833309"
model[1] Root Mean Squared Error: "0.19790169110806397"
model[2] Root Mean Squared Error: "0.15209188089553274"
model[3] Root Mean Squared Error: "0.1434753947007558"
In [64]:
# Check R2 Score
from sklearn.metrics import r2_score

for i in range(len(model)):
    y_predicted = model[i].predict(X_test)

    rmse = r2_score(y_test, y_predicted)
    
    print('model[{}] R2 Score: "{}"'.format(i, rmse))
model[0] R2 Score: "0.731391921556362"
model[1] R2 Score: "0.7537371404701725"
model[2] R2 Score: "0.8545505883281904"
model[3] R2 Score: "0.8705640952692706"
In [65]:
# Implement-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model[3], X = X_train, y = y_train, cv = 10)
print('Mean Accuracy:', accuracies.mean())
print('Standard Deviation:', accuracies.std())
Mean Accuracy: 0.8674728138350151
Standard Deviation: 0.02599023984769724
In [66]:
# Get feature importance
importances = pd.DataFrame({'feature': x.columns, 'importance': np.round(model[3].feature_importances_, 3)})
importances = importances.sort_values('importance', ascending = False).set_index('feature')
print(importances.head(10))
                     importance
feature                        
BsmtQual_Ex               0.169
GarageCars                0.080
Exterior1st_BrkComm       0.077
CentralAir_Y              0.069
MSZoning_RM               0.049
TotalBsmtSF               0.037
OverallQual               0.035
Fireplaces                0.023
GrLivArea                 0.023
Street_Pave               0.022
In [67]:
# Visualize the importance
importances.head(10).plot.bar()
plt.show()

Conclusion

There are several factors that influence the price a buyer is willing to pay for a house. Some are apparent and obvious and some are not. Nevertheless, a rational approach facilitated by machine learning can be very useful in predicting the house price.

A large data set with 79 different features (like living area, number of rooms, location etc) along with their prices are provided for residential homes in Ames, Iowa. We obtained an accuracy of 87% by using XGboost model, we learn relationship between the important features and the price and use it to predict the prices of a new set of houses.