Case Study: House Prices Prediction¶

Girl in a jacket

Introduction¶

Predict the price of a house by its features. If you are a buyer or seller of the house but you don’t know the exact price of the house, so supervised machine learning regression algorithms can help you to predict the price of the house just providing features of the target house.

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. This dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

Problem:

Predict the final price of each home

Dataset:

SalePrice - the property's sale price in dollars.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

Source: The Ames Housing dataset was compiled by Dean De Cock.

Libraries and Dataset Importation¶

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()

Dataset Exploration¶

# Import data set
df_raw = pd.read_csv('project_data/house_price_train.csv')
df_raw.head(10)

# Check the description
df_raw.describe(include = 'all')

# Check the shape
df_raw.shape

(1460, 81)

# Check more info
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

Data Visualization¶

#Rearange columns
df_raw = df_raw[['SalePrice','Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu',
       'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
       'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive',
       'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature',
       'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']]

# Show correlation in the features by using heatmap
from heatmap import corrplot
plt.figure(figsize=(12, 8))
corrplot(df_raw.corr(), size_scale=300)

# When was the property sold in which year and month?
df_raw.groupby(['YrSold', 'MoSold']).Id.count().plot(kind = 'bar', figsize = (14,4))
plt.title('When was the property sold?')
plt.show()

# Where are the property located?
df_raw.groupby(['Neighborhood']).Id.count().sort_values().plot(kind = 'bar', figsize = (14,4))
plt.title('Where are the most of the property located?')
plt.show()

# Show for SalePrice with Overallqual
plt.figure(figsize= (10, 5))
sns.barplot(df_raw['OverallQual'], df_raw['SalePrice'])
plt.show()

# Show regression plot
plt.figure(figsize= (10, 5))
sns.regplot(df_raw['GrLivArea'], df_raw['SalePrice'], scatter_kws= {'alpha': 0.3})
plt.show()

# Show SalePrice distribution
sns.distplot(df_raw['SalePrice'])
plt.show()

Feature Engineering¶

Mathematical Transformation and Removing Outliers¶

# Most machine learning algorithm works well with data which are normally distriuted
# Trasform the dependent variable (SalePrice) by taking the log scale
target = np.log(df_raw['SalePrice'])
sns.distplot(target)
plt.show()

# Remove the outliers (Getting the 99th percentile and keep the data below)
df_raw = df_raw[df_raw['SalePrice'] <= df_raw['SalePrice'].quantile(0.99)]

# Check for outliers
plt.figure(figsize= (10, 5))
sns.regplot(df_raw['GrLivArea'], df_raw['SalePrice'], scatter_kws= {'alpha': 0.3})
plt.show()

Check Features¶

# Find numerical data
numeric_data = df_raw.select_dtypes(include = [np.number])

# Find categorical data
categorical_data = df_raw.select_dtypes(exclude = [np.number])

# Print the numbers of numerical and categorical
print('There are {0} numerical and {1} categorical features in the data'.format(numeric_data.shape[1], categorical_data.shape[1]))

# Check if there are duplicates
print('Dataset duplicate IDs: {}'.format(df_raw.duplicated('Id').sum()))

There are 38 numerical and 43 categorical features in the data
Dataset duplicate IDs: 0

# Check for missing values
df_raw.isna().mean()

SalePrice        0.000000
Id               0.000000
MSSubClass       0.000000
MSZoning         0.000000
LotFrontage      0.178547
                   ...   
MiscVal          0.000000
MoSold           0.000000
YrSold           0.000000
SaleType         0.000000
SaleCondition    0.000000
Length: 81, dtype: float64

Missing Data Imputation¶

Columns that contains less than 5% of the observational data will be deleted since these will not share any significant value in the result. Columns that has a missing values not greater than 5% will be inserted random sample values.

Columns that has a missing values greater than 5% will still be included. In missing numerical values, it will be inserted by its median values with an added variable column to represent its missing value while the missing categorical values will be inserted a "missing value" because it may share importance in the result.

List of columns that contains missing values

3 LotFrontage 1201 non-null float64
6 Alley 91 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
35 BsmtFinType2 1422 non-null object
42 Electrical 1459 non-null object
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object

Columns that will be deleted:

- 6   Alley          91 non-null     object 
- 72  PoolQC         7 non-null      object
- 74  MiscFeature    54 non-null     object

Columns that will be filled with random sample data:

- 25  MasVnrType     1452 non-null   object
- 26  MasVnrArea     1452 non-null   float64
- 30  BsmtQual       1423 non-null   object
- 31  BsmtCond       1423 non-null   object
- 32  BsmtExposure   1422 non-null   object
- 32  BsmtExposure   1422 non-null   object 
- 33  BsmtFinType1   1423 non-null   object 
- 35  BsmtFinType2   1422 non-null   object
- 42  Electrical     1459 non-null   object 
- 59  GarageYrBlt    1379 non-null   float64
- 60  GarageFinish   1379 non-null   object 
- 63  GarageQual     1379 non-null   object 
- 64  GarageCond     1379 non-null   object
- 58  GarageType     1379 non-null   object

Columns that will be filled with median values and added a missing variable column

- 3   LotFrontage    1201 non-null   float64

Columns that will be filled "missing" data in the missing categorical data

- 57  FireplaceQu    770 non-null    object
- 73  Fence          281 non-null    object

# Import feature engine library for data engineerng
from feature_engine import missing_data_imputers as mdi

#Create a copy
data = df_raw.copy()

# Remove columns that is less than 5% of the observational data
data = data.drop(['Alley', 'PoolQC', 'MiscFeature'], axis = 1)

# Insert random samples in data which has a missing values equal or greater than 5%
crs_imputer = mdi.RandomSampleImputer(random_state=0, variables=['MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond'])
crs_imputer.fit(data)
data_v1 = crs_imputer.transform(data)

# Add a missing variable column to represent the missing numerical values that is greater than 5%
cmav_imputer = mdi.AddMissingIndicator(variables = ['LotFrontage'])
cmav_imputer.fit(data_v1)
data_v2 = cmav_imputer.transform(data_v1)

# Insert median values in missing numerical data that is greater than 5%
median_imputer = mdi.MeanMedianImputer(imputation_method='median', variables=['LotFrontage'])
median_imputer.fit(data_v2)
data_v3 = median_imputer.transform(data_v2)

# Insert "Missing" in missing categorical data
cat_imputer = mdi.CategoricalVariableImputer(variables=['FireplaceQu', 'Fence'])
cat_imputer.fit(data_v3)
data_v4 = cat_imputer.transform(data_v3)

data_v4.head()

Categorical Variable Encoding¶

data_v4.select_dtypes(exclude = [np.number])

# Apply OneHotEncoder
from feature_engine.categorical_encoders import OneHotCategoricalEncoder

ohe = OneHotCategoricalEncoder(top_categories=None)
ohe.fit(data_v4)
data_v5 = ohe.transform(data_v4)

data_v5.head()

Data Pre-processing¶

# Define the indepedent and dependent variables
x = data_v5.drop(['SalePrice', 'Id'], axis = 1)
y = np.log(data_v5['SalePrice'])

# Split the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((1156, 282), (1156,), (289, 282), (289,))

# Scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
X_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)

Model creation¶

# Create Machine Learning models
def models(X_train, y_train):
    
    # Fitting SVR to the dataset
    from sklearn.svm import SVR
    svr = SVR(kernel = 'rbf')
    svr.fit(X_train, y_train)
    
    # Fitting Decision Tree Regression to the dataset
    from sklearn.tree import DecisionTreeRegressor
    tree = DecisionTreeRegressor(random_state = 0)
    tree.fit(X_train, y_train)
    
    # Fitting Random Forest Regression to the dataset
    from sklearn.ensemble import RandomForestRegressor
    forest = RandomForestRegressor(n_estimators = 20, criterion = 'mse', random_state = 0)
    forest.fit(X_train, y_train)
    
    # Fitting XGBoost Regression to the dataset
    from xgboost import XGBRegressor
    xgb = XGBRegressor()
    xgb.fit(X_train, y_train)
        
    return svr, tree, forest, xgb

model = models(X_train, y_train)

Model Evaluation¶

# Check Mean Absolute Error
from sklearn.metrics import mean_absolute_error

for i in range(len(model)):
    y_predicted = model[i].predict(X_test)

    rmse = mean_absolute_error(y_test, y_predicted)
    
    print('model[{}] Mean Absolute Error: "{}"'.format(i, rmse))

model[0] Mean Absolute Error: "0.12544607717202494"
model[1] Mean Absolute Error: "0.1335431310233288"
model[2] Mean Absolute Error: "0.09739699530831145"
model[3] Mean Absolute Error: "0.09727659023850078"

# Check Mean Squared Error
from sklearn.metrics import mean_squared_error
from math import sqrt

for i in range(len(model)):
    y_predicted = model[i].predict(X_test)

    rmse = mean_squared_error(y_test, y_predicted)
    
    print('model[{}] Mean Squared Error: "{}"'.format(i, rmse))

model[0] Mean Squared Error: "0.04271881161705415"
model[1] Mean Squared Error: "0.039165079343431566"
model[2] Mean Squared Error: "0.02313194023434092"
model[3] Mean Squared Error: "0.020585188884537663"

# Check Root Mean Squared Error
from math import sqrt

for i in range(len(model)):
    y_predicted = model[i].predict(X_test)

    rmse = sqrt(mean_squared_error(y_test, y_predicted))
    
    print('model[{}] Root Mean Squared Error: "{}"'.format(i, rmse))

model[0] Root Mean Squared Error: "0.2066852960833309"
model[1] Root Mean Squared Error: "0.19790169110806397"
model[2] Root Mean Squared Error: "0.15209188089553274"
model[3] Root Mean Squared Error: "0.1434753947007558"

# Check R2 Score
from sklearn.metrics import r2_score

for i in range(len(model)):
    y_predicted = model[i].predict(X_test)

    rmse = r2_score(y_test, y_predicted)
    
    print('model[{}] R2 Score: "{}"'.format(i, rmse))

model[0] R2 Score: "0.731391921556362"
model[1] R2 Score: "0.7537371404701725"
model[2] R2 Score: "0.8545505883281904"
model[3] R2 Score: "0.8705640952692706"

# Implement-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model[3], X = X_train, y = y_train, cv = 10)
print('Mean Accuracy:', accuracies.mean())
print('Standard Deviation:', accuracies.std())

Mean Accuracy: 0.8674728138350151
Standard Deviation: 0.02599023984769724

# Get feature importance
importances = pd.DataFrame({'feature': x.columns, 'importance': np.round(model[3].feature_importances_, 3)})
importances = importances.sort_values('importance', ascending = False).set_index('feature')
print(importances.head(10))

                     importance
feature                        
BsmtQual_Ex               0.169
GarageCars                0.080
Exterior1st_BrkComm       0.077
CentralAir_Y              0.069
MSZoning_RM               0.049
TotalBsmtSF               0.037
OverallQual               0.035
Fireplaces                0.023
GrLivArea                 0.023
Street_Pave               0.022

# Visualize the importance
importances.head(10).plot.bar()
plt.show()

Conclusion¶

There are several factors that influence the price a buyer is willing to pay for a house. Some are apparent and obvious and some are not. Nevertheless, a rational approach facilitated by machine learning can be very useful in predicting the house price.

A large data set with 79 different features (like living area, number of rooms, location etc) along with their prices are provided for residential homes in Ames, Iowa. We obtained an accuracy of 87% by using XGboost model, we learn relationship between the important features and the price and use it to predict the prices of a new set of houses.

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	0	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	0	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	0	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	0	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	0	12	2008	WD	Normal	250000
5	6	50	RL	85.0	14115	Pave	NaN	IR1	Lvl	AllPub	...	NaN	MnPrv	Shed	700	10	2009	WD	Normal	143000
6	7	20	RL	75.0	10084	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	0	8	2007	WD	Normal	307000
7	8	60	RL	NaN	10382	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	Shed	350	11	2009	WD	Normal	200000
8	9	50	RM	51.0	6120	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	0	4	2008	WD	Abnorml	129900
9	10	190	RL	50.0	7420	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	0	1	2008	WD	Normal	118000

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
count	1460.000000	1460.000000	1460	1201.000000	1460.000000	1460	91	1460	1460	1460	...	1460.000000	7	281	54	1460.000000	1460.000000	1460.000000	1460	1460	1460.000000
unique	NaN	NaN	5	NaN	NaN	2	2	4	4	2	...	NaN	3	4	4	NaN	NaN	NaN	9	6	NaN
top	NaN	NaN	RL	NaN	NaN	Pave	Grvl	Reg	Lvl	AllPub	...	NaN	Gd	MnPrv	Shed	NaN	NaN	NaN	WD	Normal	NaN
freq	NaN	NaN	1151	NaN	NaN	1454	50	925	1311	1459	...	NaN	3	157	49	NaN	NaN	NaN	1267	1198	NaN
mean	730.500000	56.897260	NaN	70.049958	10516.828082	NaN	NaN	NaN	NaN	NaN	...	2.758904	NaN	NaN	NaN	43.489041	6.321918	2007.815753	NaN	NaN	180921.195890
std	421.610009	42.300571	NaN	24.284752	9981.264932	NaN	NaN	NaN	NaN	NaN	...	40.177307	NaN	NaN	NaN	496.123024	2.703626	1.328095	NaN	NaN	79442.502883
min	1.000000	20.000000	NaN	21.000000	1300.000000	NaN	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	1.000000	2006.000000	NaN	NaN	34900.000000
25%	365.750000	20.000000	NaN	59.000000	7553.500000	NaN	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	5.000000	2007.000000	NaN	NaN	129975.000000
50%	730.500000	50.000000	NaN	69.000000	9478.500000	NaN	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	6.000000	2008.000000	NaN	NaN	163000.000000
75%	1095.250000	70.000000	NaN	80.000000	11601.500000	NaN	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	8.000000	2009.000000	NaN	NaN	214000.000000
max	1460.000000	190.000000	NaN	313.000000	215245.000000	NaN	NaN	NaN	NaN	NaN	...	738.000000	NaN	NaN	NaN	15500.000000	12.000000	2010.000000	NaN	NaN	755000.000000

	SalePrice	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	LotShape	LandContour	Utilities	...	Fence	MoSold	YrSold	SaleType	SaleCondition
0	208500	1	60	RL	65.0	8450	Pave	Reg	Lvl	AllPub	...	Missing	2	2008	WD	Normal
1	181500	2	20	RL	80.0	9600	Pave	Reg	Lvl	AllPub	...	Missing	5	2007	WD	Normal
2	223500	3	60	RL	68.0	11250	Pave	IR1	Lvl	AllPub	...	Missing	9	2008	WD	Normal
3	140000	4	70	RL	60.0	9550	Pave	IR1	Lvl	AllPub	...	Missing	2	2006	WD	Abnorml
4	250000	5	60	RL	84.0	14260	Pave	IR1	Lvl	AllPub	...	Missing	12	2008	WD	Normal

	MSZoning	Street	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	...	Functional	FireplaceQu	GarageType	GarageFinish	GarageQual	GarageCond	PavedDrive	Fence	SaleType	SaleCondition
0	RL	Pave	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	...	Typ	Missing	Attchd	RFn	TA	TA	Y	Missing	WD	Normal
1	RL	Pave	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	...	Typ	TA	Attchd	RFn	TA	TA	Y	Missing	WD	Normal
2	RL	Pave	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	...	Typ	TA	Attchd	RFn	TA	TA	Y	Missing	WD	Normal
3	RL	Pave	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	...	Typ	Gd	Detchd	Unf	TA	TA	Y	Missing	WD	Abnorml
4	RL	Pave	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	...	Typ	TA	Attchd	RFn	TA	TA	Y	Missing	WD	Normal
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1455	RL	Pave	Reg	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	...	Typ	TA	Attchd	RFn	TA	TA	Y	Missing	WD	Normal
1456	RL	Pave	Reg	Lvl	AllPub	Inside	Gtl	NWAmes	Norm	Norm	...	Min1	TA	Attchd	Unf	TA	TA	Y	MnPrv	WD	Normal
1457	RL	Pave	Reg	Lvl	AllPub	Inside	Gtl	Crawfor	Norm	Norm	...	Typ	Gd	Attchd	RFn	TA	TA	Y	GdPrv	WD	Normal
1458	RL	Pave	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Norm	Norm	...	Typ	Missing	Attchd	Unf	TA	TA	Y	Missing	WD	Normal
1459	RL	Pave	Reg	Lvl	AllPub	Inside	Gtl	Edwards	Norm	Norm	...	Typ	Missing	Attchd	Fin	TA	TA	Y	Missing	WD	Normal

	SalePrice	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	...	SaleCondition_Normal	SaleCondition_Abnorml
0	208500	1	60	65.0	8450	7	5	2003	2003	196.0	...	1	0
1	181500	2	20	80.0	9600	6	8	1976	1976	0.0	...	1	0
2	223500	3	60	68.0	11250	7	5	2001	2002	162.0	...	1	0
3	140000	4	70	60.0	9550	7	5	1915	1970	0.0	...	0	1
4	250000	5	60	84.0	14260	8	5	2000	2000	350.0	...	1	0