Private/Public University Classification Using Spark ML

Overview

  • The data set has features of universities and labeled either private or public.

  • In this project, three tree classification methods of Spark ML will be implemented and compare their results on Universities data set.

Source:

  • Kaggle

Install Pyspark

In [1]:
# Install pyspark
!pip install pyspark
Collecting pyspark
  Downloading https://files.pythonhosted.org/packages/8e/b0/bf9020b56492281b9c9d8aae8f44ff51e1bc91b3ef5a884385cb4e389a40/pyspark-3.0.0.tar.gz (204.7MB)
     |████████████████████████████████| 204.7MB 67kB/s 
Collecting py4j==0.10.9
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
     |████████████████████████████████| 204kB 40.2MB/s 
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... done
  Created wheel for pyspark: filename=pyspark-3.0.0-py2.py3-none-any.whl size=205044182 sha256=1198b475ceed345be6c2f72487020cf90d35a8c0e2fe0e31f033c494d83dfe69
  Stored in directory: /root/.cache/pip/wheels/57/27/4d/ddacf7143f8d5b76c45c61ee2e43d9f8492fc5a8e78ebd7d37
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.0

Create Spark Session

In [2]:
# Import library
from pyspark.sql import SparkSession
In [3]:
# Create spark session
spark = SparkSession.builder.appName('tree').getOrCreate()
In [4]:
# Import csv file
data = spark.read.csv('/content/College.csv', inferSchema = True, header = True)
In [5]:
# Check schema
data.printSchema()
root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)

In [6]:
# Check data row
data.head(1)
Out[6]:
[Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)]

Vector Assembler Application

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.

In [7]:
# Import library
from pyspark.ml.feature import VectorAssembler

# Check data columns
data.columns
Out[7]:
['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']
In [8]:
# Get the useful features
assembler = VectorAssembler(inputCols = ['Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate'], outputCol = 'features')
In [9]:
# Transformer that combines a given list of columns into a single vector column
output = assembler.transform(data)

String Indexer Application

StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as Estimator or Transformer make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol.

In [10]:
# Import library
from pyspark.ml.feature import StringIndexer

# Encodes a string column of labels to a column of label indices
indexer = StringIndexer(inputCol = 'Private', outputCol = 'PrivateIndex')
In [11]:
# Encodes a string column of labels to a column of label indices. 
output_fixed = indexer.fit(output).transform(output)
In [12]:
# Check schema
output_fixed.printSchema()
root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = false)

Data Splitting

In [13]:
# Get the independent and dependent variable
final_data = output_fixed.select('features', 'PrivateIndex')
In [14]:
# Split the data for training and testing
train_data, test_data = final_data.randomSplit([0.7, 0.3])

Model Creation

Models:

- DecisionTreeClassifier
- RandomForestClassifier
- GBTClassifier
In [15]:
# Import libraries
from pyspark.ml.classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier)
from pyspark.ml import Pipeline
In [16]:
# Model label
dtc = DecisionTreeClassifier(labelCol = 'PrivateIndex', featuresCol= 'features')
rfc = RandomForestClassifier(labelCol = 'PrivateIndex', featuresCol= 'features')
gbt = GBTClassifier(labelCol = 'PrivateIndex', featuresCol= 'features')
In [17]:
# Fit the model
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)
In [18]:
# Get prediction on test data
dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)

Model Evaluation

In [19]:
# Import library
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Check evaulation
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')

# Show evaluation
print('DTC:', my_binary_eval.evaluate(dtc_preds))
DTC: 0.7815201192250373
In [20]:
# Check evaulation
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')

# Show evaluation
print('RFC:', my_binary_eval.evaluate(dtc_preds))
RFC: 0.7815201192250373
In [21]:
# Check evaulation
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')

# Show evaluation
print('DTC:', my_binary_eval.evaluate(rfc_preds))
DTC: 0.9819175360158966
In [22]:
# Check schema
gbt_preds.printSchema()
root
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = false)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

In [23]:
# Check evaluation
my_binary_eval_2 = BinaryClassificationEvaluator(labelCol = 'PrivateIndex', rawPredictionCol= 'prediction')

# Show evaluation
print('GBT:', my_binary_eval_2.evaluate(gbt_preds))
GBT: 0.8271733730750124
In [24]:
# Import library
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Accuracy function
acc_eval = MulticlassClassificationEvaluator(labelCol = 'PrivateIndex', metricName = 'accuracy')
In [25]:
# Check accuracy
dtc_acc = acc_eval.evaluate(dtc_preds)

# Show accuracy
dtc_acc
Out[25]:
0.8529411764705882
In [26]:
# Check accuracy
rfc_acc = acc_eval.evaluate(rfc_preds)

# Show accuracy
rfc_acc
Out[26]:
0.9243697478991597
In [27]:
# Check accuracy
gbt_acc = acc_eval.evaluate(gbt_preds)

# Show accuracy
gbt_acc
Out[27]:
0.8613445378151261

Conclusion

  • DecisionTreeClassifier obtained 85% accuracy
  • RandomForestClassifier obtained 92% accuracy
  • GBTClassifier obtained 86% accuracy

These models can be further improve by tuning its hyper parameters.