Case Study: Quality Prediction In Mining Process¶

Introduction¶

In this project, machine learning and deep learning models will be train to predict the percentage of Silica Concentrate in the Iron ore concentrate per minute. This project could be practically used in Mining Industry to get the percentage of Silica Concentrate at a much faster rate compared to the traditional methods.

Freeport-McMoran is one of the largest mining companies in U.S.. They have made use of custom built AI model to boost the output in their mining facilities.

The 4th Industrial Revolution: How Mining Companies Are Using AI, Machine Learning And Robots:

https://www.forbes.com/sites/bernardmarr/2018/09/07/the-4th-industrial-revolution-how-mining-companies-are-using-ai-machine-learning-and-robots/#fcde6c6497e5

Problem:

Although the percentage of Silica is measured, its a lab measurement, which means that it takes at least one hour for the process engineers to have this value. So if it is posible to predict the amount of impurity in the process
Pedict the percentage of Silica Concentrate in the Iron ore concentrate per minute.

Dataset:

By leveraging past 3+ data, AI can help the mining facility sustain a faster production pace with no loss efficiency. Efficiency can be improve by performing operational tweaks to boost output. (Ex. Copper production could be increase if mili was fed with more ore per minute.)

Source: Kaggle Competition

Review¶

Libraries And Data Importation¶

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import zipfile

# Import data
mining_df = pd.read_csv('mining_data.csv')

Data Exploraton¶

# Check data
mining_df.head()

# Check data statistics
mining_df.describe()

# Check data types
mining_df.dtypes

% Iron Feed                     float64
% Silica Feed                   float64
Starch Flow                     float64
Amina Flow                      float64
Ore Pulp Flow                   float64
Ore Pulp pH                     float64
Ore Pulp Density                float64
Flotation Column 01 Air Flow    float64
Flotation Column 02 Air Flow    float64
Flotation Column 03 Air Flow    float64
Flotation Column 04 Air Flow    float64
Flotation Column 05 Air Flow    float64
Flotation Column 06 Air Flow    float64
Flotation Column 07 Air Flow    float64
Flotation Column 01 Level       float64
Flotation Column 02 Level       float64
Flotation Column 03 Level       float64
Flotation Column 04 Level       float64
Flotation Column 05 Level       float64
Flotation Column 06 Level       float64
Flotation Column 07 Level       float64
% Iron Concentrate              float64
% Silica Concentrate            float64
dtype: object

# Check for missing values
mining_df.isnull().mean()

% Iron Feed                     0.0
% Silica Feed                   0.0
Starch Flow                     0.0
Amina Flow                      0.0
Ore Pulp Flow                   0.0
Ore Pulp pH                     0.0
Ore Pulp Density                0.0
Flotation Column 01 Air Flow    0.0
Flotation Column 02 Air Flow    0.0
Flotation Column 03 Air Flow    0.0
Flotation Column 04 Air Flow    0.0
Flotation Column 05 Air Flow    0.0
Flotation Column 06 Air Flow    0.0
Flotation Column 07 Air Flow    0.0
Flotation Column 01 Level       0.0
Flotation Column 02 Level       0.0
Flotation Column 03 Level       0.0
Flotation Column 04 Level       0.0
Flotation Column 05 Level       0.0
Flotation Column 06 Level       0.0
Flotation Column 07 Level       0.0
% Iron Concentrate              0.0
% Silica Concentrate            0.0
dtype: float64

Data Visualization¶

# Create bar chart
mining_df.hist(bins = 30, figsize = (20,20), color = 'b')
plt.show()

# Import library
from heatmap import corrplot

# Show correlation in the features by using heatmap
plt.figure(figsize=(12, 8))
corrplot(mining_df.corr(), size_scale=300)

# Create scatterplot
sns.scatterplot(mining_df['% Silica Concentrate'], mining_df['% Iron Concentrate'])
plt.show()

# Create scatterplot
sns.scatterplot(mining_df['% Iron Feed'], mining_df['% Silica Feed'])
plt.show()

Data Preprocessing¶

# Label Independent variables
df_iron = mining_df.drop(columns = '% Silica Concentrate')

# Label dependent variables
df_iron_target = mining_df['% Silica Concentrate']

# Check Independent variables shape
df_iron.shape

(245700, 22)

# Check dependent variables shape
df_iron_target.shape

(245700,)

# Transform into array
df_iron = np.array(df_iron)

# Transform into array
df_iron_target = np.array(df_iron_target)

# Reshaping the array
df_iron_target = df_iron_target.reshape(-1,1)

# Check dimension
df_iron_target.shape

(245700, 1)

# Import library
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Scale the independent variables
scaler_x = StandardScaler()
X = scaler_x.fit_transform(df_iron)


# Scale the dependent variables
scaler_y = StandardScaler()
y = scaler_y.fit_transform(df_iron_target)

# Import libraru
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Check train dataset shape
X_train.shape, y_train.shape

((196560, 22), (196560, 1))

# Check train dataset shape
X_test.shape, y_test.shape

((49140, 22), (49140, 1))

Model Training And Evaluation¶

Linear Regression Model¶

# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score

# Train the model
LinearRegression_model = LinearRegression()
LinearRegression_model.fit(X_train, y_train)

LinearRegression()

# Check Linear Regression model accuracy
accuracy_LinearRegression = LinearRegression_model.score(X_test, y_test)
accuracy_LinearRegression

0.6817924522132923

Decision Tree Regressor Model¶

# Import lirbary
from sklearn.tree import DecisionTreeRegressor

# Train the model
DecisionTree_model = DecisionTreeRegressor()
DecisionTree_model.fit(X_train, y_train)

DecisionTreeRegressor()

# Check Decision Tree model accuracy
accuracy_DecisionTree = DecisionTree_model.score(X_test, y_test)
accuracy_DecisionTree

0.9827050277688161

XGBoost Regressor Model¶

# Import library
from xgboost import XGBRegressor

# Train the model
xgb = XGBRegressor()
xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

# Check XGBoost model accuracy
xgb.score(X_test, y_test)

0.9346029693675422

Artificial Neural Network Model¶

# Import library
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import Adam

# Create model architecture
optimizer = Adam(learning_rate=0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-07, amsgrad = False)
ANN_model = keras.Sequential()
ANN_model.add(Dense(250, input_dim = 22, kernel_initializer='normal',activation='relu'))
ANN_model.add(Dense(500,activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(1000, activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(1000, activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(500, activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(250, activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(1, activation = 'linear'))

# Compile the model
ANN_model.compile(loss = 'mse', optimizer = 'adam')

# Check modelsummary
ANN_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 250)               5750      
_________________________________________________________________
dense_1 (Dense)              (None, 500)               125500    
_________________________________________________________________
dropout (Dropout)            (None, 500)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1000)              501000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1000)              1001000   
_________________________________________________________________
dropout_2 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 500)               500500    
_________________________________________________________________
dropout_3 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 250)               125250    
_________________________________________________________________
dropout_4 (Dropout)          (None, 250)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 251       
=================================================================
Total params: 2,259,251
Trainable params: 2,259,251
Non-trainable params: 0
_________________________________________________________________

# Initiate model training
history = ANN_model.fit(X_train, y_train, epochs = 5, validation_split = 0.2)

Epoch 1/5
4914/4914 [==============================] - 494s 101ms/step - loss: 0.1768 - val_loss: 0.1447
Epoch 2/5
4914/4914 [==============================] - 420s 85ms/step - loss: 0.1346 - val_loss: 0.1148
Epoch 3/5
4914/4914 [==============================] - 656s 134ms/step - loss: 0.1145 - val_loss: 0.0955
Epoch 4/5
4914/4914 [==============================] - 418s 85ms/step - loss: 0.1001 - val_loss: 0.0820
Epoch 5/5
4914/4914 [==============================] - 363s 74ms/step - loss: 0.0911 - val_loss: 0.0764

# Evaluate the ANN model
result = ANN_model.evaluate(X_test, y_test)
accuracy_ANN = 1 - result
print("Accuracy : {}".format(accuracy_ANN))

1536/1536 [==============================] - 16s 11ms/step - loss: 0.0752
Accuracy : 0.9248149991035461

Calculate Regression KPIs¶

# From the above results, it can be seen that, decision tree model out-performs the other models.

# Plot True values and Model Predictions
y_predict = DecisionTree_model.predict(X_test)
plt.plot(y_predict, y_test, '^', color = 'b')
plt.xlabel('Model Predictions')
plt.ylabel('True Values')
plt.show()

# Plot True values and Model Predictions with their proper range
y_predict_orig = scaler_y.inverse_transform(y_predict)
y_test_orig = scaler_y.inverse_transform(y_test)
plt.plot(y_test_orig, y_predict_orig, "^", color = 'r')
plt.xlabel('Model Predictions')
plt.ylabel('True Values')
plt.show()

# Import libraries
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

# Check evaluation metrics
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE = mean_absolute_error(y_test_orig, y_predict_orig)
r2 = r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)

RMSE = 0.148 
MSE = 0.02202306497665563 
MAE = 0.022432923934771264 
R2 = 0.9827050277688161 
Adjusted R2 = 0.9826972811762089

Conclusion¶

The Decision Tree Model was able to obtained a 98% accuracy in prediction and also outperforms an ANN model. Although the ANN model was only trained in five iterations and can bu further improve by adding more layers, it will still take a lot of time to train and consume a lot of computing power as well. Due to the good scores in evaluation metrics the model can be deploy into the production to save process engineers a lot of time in their workload.

	% Iron Feed	% Silica Feed	Starch Flow	Amina Flow	Ore Pulp Flow	Ore Pulp pH	Ore Pulp Density	Flotation Column 01 Air Flow	Flotation Column 02 Air Flow	Flotation Column 03 Air Flow	...	Flotation Column 07 Air Flow	Flotation Column 01 Level	Flotation Column 02 Level	Flotation Column 03 Level	Flotation Column 04 Level	Flotation Column 05 Level	Flotation Column 06 Level	Flotation Column 07 Level	% Iron Concentrate	% Silica Concentrate
0	55.2	16.98	3196.680000	542.694333	396.284000	10.158367	1.668070	249.796333	250.275667	248.668000	...	250.547000	464.978667	490.450333	443.465000	442.856333	438.782333	452.248333	466.300667	67.06	1.11
1	55.2	16.98	3213.673333	540.649333	397.949333	10.156600	1.664973	249.536000	250.752000	250.968333	...	249.807000	445.001000	362.894667	442.748333	471.045333	445.239667	443.630667	426.921667	67.06	1.11
2	55.2	16.98	3180.080000	535.929333	397.305000	10.154800	1.661877	249.576000	250.279667	251.001333	...	249.686667	443.574667	478.916333	432.779333	437.401667	441.761000	490.824667	478.046667	67.06	1.11
3	55.2	16.98	3196.713333	535.102000	397.010667	10.153067	1.658780	249.380333	248.799333	250.241333	...	249.926333	440.731333	488.994000	452.461333	439.572667	434.027333	457.083667	458.815667	67.06	1.11
4	55.2	16.98	3111.723333	532.735000	395.263667	10.151300	1.655680	249.426667	252.209667	249.243333	...	249.975667	445.851667	418.860000	462.936667	454.948333	453.571667	446.831667	426.600000	67.06	1.11

	% Iron Feed	% Silica Feed	Starch Flow	Amina Flow	Ore Pulp Flow	Ore Pulp pH	Ore Pulp Density	Flotation Column 01 Air Flow	Flotation Column 02 Air Flow	Flotation Column 03 Air Flow	...	Flotation Column 07 Air Flow	Flotation Column 01 Level	Flotation Column 02 Level	Flotation Column 03 Level	Flotation Column 04 Level	Flotation Column 05 Level	Flotation Column 06 Level	Flotation Column 07 Level	% Iron Concentrate	% Silica Concentrate
count	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	...	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000	245700.000000
mean	56.294974	14.651438	2869.241181	488.144186	397.577332	9.767534	1.680348	280.166032	277.172893	281.097236	...	290.774336	520.242050	522.648563	531.355055	420.306805	425.237994	429.927646	421.006767	65.049435	2.327228
std	5.158958	6.808961	1187.990184	90.736360	9.468496	0.387036	0.069213	29.616570	29.936823	28.537193	...	28.158596	130.389539	127.450562	150.614529	90.566437	83.601851	85.320602	83.736727	1.118479	1.125623
min	42.740000	1.310000	0.074147	241.699632	376.272600	8.753370	1.519829	175.666333	175.923177	176.471917	...	186.074077	149.451600	211.266111	126.352031	162.293185	167.139620	161.485667	175.908240	62.050000	0.600000
25%	52.670000	8.940000	2073.322500	432.204667	395.212583	9.527157	1.647197	250.268667	250.367333	250.693667	...	263.524333	413.516320	442.291000	410.134583	356.440167	357.074583	358.078583	356.567833	64.370000	1.440000
50%	56.080000	13.850000	2994.311667	504.510667	399.354833	9.797963	1.697560	299.418000	297.433000	299.048333	...	299.350833	492.971167	496.380667	494.859500	410.511667	408.022833	419.931167	410.043333	65.210000	2.000000
75%	59.720000	19.600000	3712.951667	553.479083	402.458750	10.037833	1.728257	300.127333	300.435000	300.308667	...	301.239667	594.960083	595.989167	601.060000	486.533417	485.580833	490.725500	475.922283	65.860000	3.010000
max	65.780000	33.400000	6295.130657	739.422405	418.625439	10.808046	1.853229	372.387588	369.550000	359.948635	...	370.190800	862.197932	828.593000	886.820204	680.019967	675.571459	698.621871	659.618696	68.010000	5.530000