The data set has features of universities and labeled either private or public.
In this project, three tree classification methods of Spark ML will be implemented and compare their results on Universities data set.
Source:
# Install pyspark
!pip install pyspark
# Import library
from pyspark.sql import SparkSession
# Create spark session
spark = SparkSession.builder.appName('tree').getOrCreate()
# Import csv file
data = spark.read.csv('/content/College.csv', inferSchema = True, header = True)
# Check schema
data.printSchema()
# Check data row
data.head(1)
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.
# Import library
from pyspark.ml.feature import VectorAssembler
# Check data columns
data.columns
# Get the useful features
assembler = VectorAssembler(inputCols = ['Accept',
'Enroll',
'Top10perc',
'Top25perc',
'F_Undergrad',
'P_Undergrad',
'Outstate',
'Room_Board',
'Books',
'Personal',
'PhD',
'Terminal',
'S_F_Ratio',
'perc_alumni',
'Expend',
'Grad_Rate'], outputCol = 'features')
# Transformer that combines a given list of columns into a single vector column
output = assembler.transform(data)
StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as Estimator or Transformer make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol.
# Import library
from pyspark.ml.feature import StringIndexer
# Encodes a string column of labels to a column of label indices
indexer = StringIndexer(inputCol = 'Private', outputCol = 'PrivateIndex')
# Encodes a string column of labels to a column of label indices.
output_fixed = indexer.fit(output).transform(output)
# Check schema
output_fixed.printSchema()
# Get the independent and dependent variable
final_data = output_fixed.select('features', 'PrivateIndex')
# Split the data for training and testing
train_data, test_data = final_data.randomSplit([0.7, 0.3])
# Import libraries
from pyspark.ml.classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier)
from pyspark.ml import Pipeline
# Model label
dtc = DecisionTreeClassifier(labelCol = 'PrivateIndex', featuresCol= 'features')
rfc = RandomForestClassifier(labelCol = 'PrivateIndex', featuresCol= 'features')
gbt = GBTClassifier(labelCol = 'PrivateIndex', featuresCol= 'features')
# Fit the model
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)
# Get prediction on test data
dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)
# Import library
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Check evaulation
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')
# Show evaluation
print('DTC:', my_binary_eval.evaluate(dtc_preds))
# Check evaulation
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')
# Show evaluation
print('RFC:', my_binary_eval.evaluate(dtc_preds))
# Check evaulation
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')
# Show evaluation
print('DTC:', my_binary_eval.evaluate(rfc_preds))
# Check schema
gbt_preds.printSchema()
# Check evaluation
my_binary_eval_2 = BinaryClassificationEvaluator(labelCol = 'PrivateIndex', rawPredictionCol= 'prediction')
# Show evaluation
print('GBT:', my_binary_eval_2.evaluate(gbt_preds))
# Import library
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Accuracy function
acc_eval = MulticlassClassificationEvaluator(labelCol = 'PrivateIndex', metricName = 'accuracy')
# Check accuracy
dtc_acc = acc_eval.evaluate(dtc_preds)
# Show accuracy
dtc_acc
# Check accuracy
rfc_acc = acc_eval.evaluate(rfc_preds)
# Show accuracy
rfc_acc
# Check accuracy
gbt_acc = acc_eval.evaluate(gbt_preds)
# Show accuracy
gbt_acc
These models can be further improve by tuning its hyper parameters.