Private/Public University Classification Using Spark ML¶

Overview¶

The data set has features of universities and labeled either private or public.
In this project, three tree classification methods of Spark ML will be implemented and compare their results on Universities data set.

Source:

Kaggle

Install Pyspark¶

# Install pyspark
!pip install pyspark

Collecting pyspark
  Downloading https://files.pythonhosted.org/packages/8e/b0/bf9020b56492281b9c9d8aae8f44ff51e1bc91b3ef5a884385cb4e389a40/pyspark-3.0.0.tar.gz (204.7MB)
     |████████████████████████████████| 204.7MB 67kB/s 
Collecting py4j==0.10.9
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
     |████████████████████████████████| 204kB 40.2MB/s 
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... done
  Created wheel for pyspark: filename=pyspark-3.0.0-py2.py3-none-any.whl size=205044182 sha256=1198b475ceed345be6c2f72487020cf90d35a8c0e2fe0e31f033c494d83dfe69
  Stored in directory: /root/.cache/pip/wheels/57/27/4d/ddacf7143f8d5b76c45c61ee2e43d9f8492fc5a8e78ebd7d37
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.0

Create Spark Session¶

# Import library
from pyspark.sql import SparkSession

# Create spark session
spark = SparkSession.builder.appName('tree').getOrCreate()

# Import csv file
data = spark.read.csv('/content/College.csv', inferSchema = True, header = True)

# Check schema
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)

# Check data row
data.head(1)

[Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)]

Vector Assembler Application¶

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.

# Import library
from pyspark.ml.feature import VectorAssembler

# Check data columns
data.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

# Get the useful features
assembler = VectorAssembler(inputCols = ['Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate'], outputCol = 'features')

# Transformer that combines a given list of columns into a single vector column
output = assembler.transform(data)

String Indexer Application¶

StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as Estimator or Transformer make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol.

# Import library
from pyspark.ml.feature import StringIndexer

# Encodes a string column of labels to a column of label indices
indexer = StringIndexer(inputCol = 'Private', outputCol = 'PrivateIndex')

# Encodes a string column of labels to a column of label indices. 
output_fixed = indexer.fit(output).transform(output)

# Check schema
output_fixed.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = false)

Data Splitting¶

# Get the independent and dependent variable
final_data = output_fixed.select('features', 'PrivateIndex')

# Split the data for training and testing
train_data, test_data = final_data.randomSplit([0.7, 0.3])

Model Creation¶

Models:

- DecisionTreeClassifier
- RandomForestClassifier
- GBTClassifier

# Import libraries
from pyspark.ml.classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier)
from pyspark.ml import Pipeline

# Model label
dtc = DecisionTreeClassifier(labelCol = 'PrivateIndex', featuresCol= 'features')
rfc = RandomForestClassifier(labelCol = 'PrivateIndex', featuresCol= 'features')
gbt = GBTClassifier(labelCol = 'PrivateIndex', featuresCol= 'features')

# Fit the model
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

# Get prediction on test data
dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)

Model Evaluation¶

# Import library
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Check evaulation
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')

# Show evaluation
print('DTC:', my_binary_eval.evaluate(dtc_preds))

DTC: 0.7815201192250373

# Check evaulation
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')

# Show evaluation
print('RFC:', my_binary_eval.evaluate(dtc_preds))

RFC: 0.7815201192250373

# Check evaulation
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')

# Show evaluation
print('DTC:', my_binary_eval.evaluate(rfc_preds))

DTC: 0.9819175360158966

# Check schema
gbt_preds.printSchema()

root
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = false)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

# Check evaluation
my_binary_eval_2 = BinaryClassificationEvaluator(labelCol = 'PrivateIndex', rawPredictionCol= 'prediction')

# Show evaluation
print('GBT:', my_binary_eval_2.evaluate(gbt_preds))

GBT: 0.8271733730750124

# Import library
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Accuracy function
acc_eval = MulticlassClassificationEvaluator(labelCol = 'PrivateIndex', metricName = 'accuracy')

# Check accuracy
dtc_acc = acc_eval.evaluate(dtc_preds)

# Show accuracy
dtc_acc

0.8529411764705882

# Check accuracy
rfc_acc = acc_eval.evaluate(rfc_preds)

# Show accuracy
rfc_acc

0.9243697478991597

# Check accuracy
gbt_acc = acc_eval.evaluate(gbt_preds)

# Show accuracy
gbt_acc

0.8613445378151261

Conclusion¶

DecisionTreeClassifier obtained 85% accuracy
RandomForestClassifier obtained 92% accuracy
GBTClassifier obtained 86% accuracy

These models can be further improve by tuning its hyper parameters.