Case Study: Quality Prediction In Mining Process

Introduction

In this project, machine learning and deep learning models will be train to predict the percentage of Silica Concentrate in the Iron ore concentrate per minute. This project could be practically used in Mining Industry to get the percentage of Silica Concentrate at a much faster rate compared to the traditional methods.

Freeport-McMoran is one of the largest mining companies in U.S.. They have made use of custom built AI model to boost the output in their mining facilities.

The 4th Industrial Revolution: How Mining Companies Are Using AI, Machine Learning And Robots:

Problem:

  • Although the percentage of Silica is measured, its a lab measurement, which means that it takes at least one hour for the process engineers to have this value. So if it is posible to predict the amount of impurity in the process
  • Pedict the percentage of Silica Concentrate in the Iron ore concentrate per minute.

Dataset:

  • By leveraging past 3+ data, AI can help the mining facility sustain a faster production pace with no loss efficiency. Efficiency can be improve by performing operational tweaks to boost output. (Ex. Copper production could be increase if mili was fed with more ore per minute.)

Source: Kaggle Competition

Review

Libraries And Data Importation

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import zipfile
In [2]:
# Import data
mining_df = pd.read_csv('mining_data.csv')

Data Exploraton

In [4]:
# Check data
mining_df.head()
Out[4]:
% Iron Feed % Silica Feed Starch Flow Amina Flow Ore Pulp Flow Ore Pulp pH Ore Pulp Density Flotation Column 01 Air Flow Flotation Column 02 Air Flow Flotation Column 03 Air Flow ... Flotation Column 07 Air Flow Flotation Column 01 Level Flotation Column 02 Level Flotation Column 03 Level Flotation Column 04 Level Flotation Column 05 Level Flotation Column 06 Level Flotation Column 07 Level % Iron Concentrate % Silica Concentrate
0 55.2 16.98 3196.680000 542.694333 396.284000 10.158367 1.668070 249.796333 250.275667 248.668000 ... 250.547000 464.978667 490.450333 443.465000 442.856333 438.782333 452.248333 466.300667 67.06 1.11
1 55.2 16.98 3213.673333 540.649333 397.949333 10.156600 1.664973 249.536000 250.752000 250.968333 ... 249.807000 445.001000 362.894667 442.748333 471.045333 445.239667 443.630667 426.921667 67.06 1.11
2 55.2 16.98 3180.080000 535.929333 397.305000 10.154800 1.661877 249.576000 250.279667 251.001333 ... 249.686667 443.574667 478.916333 432.779333 437.401667 441.761000 490.824667 478.046667 67.06 1.11
3 55.2 16.98 3196.713333 535.102000 397.010667 10.153067 1.658780 249.380333 248.799333 250.241333 ... 249.926333 440.731333 488.994000 452.461333 439.572667 434.027333 457.083667 458.815667 67.06 1.11
4 55.2 16.98 3111.723333 532.735000 395.263667 10.151300 1.655680 249.426667 252.209667 249.243333 ... 249.975667 445.851667 418.860000 462.936667 454.948333 453.571667 446.831667 426.600000 67.06 1.11

5 rows × 23 columns

In [8]:
# Check data statistics
mining_df.describe()
Out[8]:
% Iron Feed % Silica Feed Starch Flow Amina Flow Ore Pulp Flow Ore Pulp pH Ore Pulp Density Flotation Column 01 Air Flow Flotation Column 02 Air Flow Flotation Column 03 Air Flow ... Flotation Column 07 Air Flow Flotation Column 01 Level Flotation Column 02 Level Flotation Column 03 Level Flotation Column 04 Level Flotation Column 05 Level Flotation Column 06 Level Flotation Column 07 Level % Iron Concentrate % Silica Concentrate
count 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 ... 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000 245700.000000
mean 56.294974 14.651438 2869.241181 488.144186 397.577332 9.767534 1.680348 280.166032 277.172893 281.097236 ... 290.774336 520.242050 522.648563 531.355055 420.306805 425.237994 429.927646 421.006767 65.049435 2.327228
std 5.158958 6.808961 1187.990184 90.736360 9.468496 0.387036 0.069213 29.616570 29.936823 28.537193 ... 28.158596 130.389539 127.450562 150.614529 90.566437 83.601851 85.320602 83.736727 1.118479 1.125623
min 42.740000 1.310000 0.074147 241.699632 376.272600 8.753370 1.519829 175.666333 175.923177 176.471917 ... 186.074077 149.451600 211.266111 126.352031 162.293185 167.139620 161.485667 175.908240 62.050000 0.600000
25% 52.670000 8.940000 2073.322500 432.204667 395.212583 9.527157 1.647197 250.268667 250.367333 250.693667 ... 263.524333 413.516320 442.291000 410.134583 356.440167 357.074583 358.078583 356.567833 64.370000 1.440000
50% 56.080000 13.850000 2994.311667 504.510667 399.354833 9.797963 1.697560 299.418000 297.433000 299.048333 ... 299.350833 492.971167 496.380667 494.859500 410.511667 408.022833 419.931167 410.043333 65.210000 2.000000
75% 59.720000 19.600000 3712.951667 553.479083 402.458750 10.037833 1.728257 300.127333 300.435000 300.308667 ... 301.239667 594.960083 595.989167 601.060000 486.533417 485.580833 490.725500 475.922283 65.860000 3.010000
max 65.780000 33.400000 6295.130657 739.422405 418.625439 10.808046 1.853229 372.387588 369.550000 359.948635 ... 370.190800 862.197932 828.593000 886.820204 680.019967 675.571459 698.621871 659.618696 68.010000 5.530000

8 rows × 23 columns

In [5]:
# Check data types
mining_df.dtypes
Out[5]:
% Iron Feed                     float64
% Silica Feed                   float64
Starch Flow                     float64
Amina Flow                      float64
Ore Pulp Flow                   float64
Ore Pulp pH                     float64
Ore Pulp Density                float64
Flotation Column 01 Air Flow    float64
Flotation Column 02 Air Flow    float64
Flotation Column 03 Air Flow    float64
Flotation Column 04 Air Flow    float64
Flotation Column 05 Air Flow    float64
Flotation Column 06 Air Flow    float64
Flotation Column 07 Air Flow    float64
Flotation Column 01 Level       float64
Flotation Column 02 Level       float64
Flotation Column 03 Level       float64
Flotation Column 04 Level       float64
Flotation Column 05 Level       float64
Flotation Column 06 Level       float64
Flotation Column 07 Level       float64
% Iron Concentrate              float64
% Silica Concentrate            float64
dtype: object
In [7]:
# Check for missing values
mining_df.isnull().mean()
Out[7]:
% Iron Feed                     0.0
% Silica Feed                   0.0
Starch Flow                     0.0
Amina Flow                      0.0
Ore Pulp Flow                   0.0
Ore Pulp pH                     0.0
Ore Pulp Density                0.0
Flotation Column 01 Air Flow    0.0
Flotation Column 02 Air Flow    0.0
Flotation Column 03 Air Flow    0.0
Flotation Column 04 Air Flow    0.0
Flotation Column 05 Air Flow    0.0
Flotation Column 06 Air Flow    0.0
Flotation Column 07 Air Flow    0.0
Flotation Column 01 Level       0.0
Flotation Column 02 Level       0.0
Flotation Column 03 Level       0.0
Flotation Column 04 Level       0.0
Flotation Column 05 Level       0.0
Flotation Column 06 Level       0.0
Flotation Column 07 Level       0.0
% Iron Concentrate              0.0
% Silica Concentrate            0.0
dtype: float64

Data Visualization

In [11]:
# Create bar chart
mining_df.hist(bins = 30, figsize = (20,20), color = 'b')
plt.show()
In [12]:
# Import library
from heatmap import corrplot

# Show correlation in the features by using heatmap
plt.figure(figsize=(12, 8))
corrplot(mining_df.corr(), size_scale=300)
In [14]:
# Create scatterplot
sns.scatterplot(mining_df['% Silica Concentrate'], mining_df['% Iron Concentrate'])
plt.show()
In [21]:
# Create scatterplot
sns.scatterplot(mining_df['% Iron Feed'], mining_df['% Silica Feed'])
plt.show()

Data Preprocessing

In [22]:
# Label Independent variables
df_iron = mining_df.drop(columns = '% Silica Concentrate')

# Label dependent variables
df_iron_target = mining_df['% Silica Concentrate']
In [23]:
# Check Independent variables shape
df_iron.shape
Out[23]:
(245700, 22)
In [24]:
# Check dependent variables shape
df_iron_target.shape
Out[24]:
(245700,)
In [25]:
# Transform into array
df_iron = np.array(df_iron)

# Transform into array
df_iron_target = np.array(df_iron_target)
In [30]:
# Reshaping the array
df_iron_target = df_iron_target.reshape(-1,1)

# Check dimension
df_iron_target.shape
Out[30]:
(245700, 1)
In [32]:
# Import library
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Scale the independent variables
scaler_x = StandardScaler()
X = scaler_x.fit_transform(df_iron)


# Scale the dependent variables
scaler_y = StandardScaler()
y = scaler_y.fit_transform(df_iron_target)
In [34]:
# Import libraru
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
In [37]:
# Check train dataset shape
X_train.shape, y_train.shape
Out[37]:
((196560, 22), (196560, 1))
In [38]:
# Check train dataset shape
X_test.shape, y_test.shape
Out[38]:
((49140, 22), (49140, 1))

Model Training And Evaluation

Linear Regression Model

In [43]:
# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score

# Train the model
LinearRegression_model = LinearRegression()
LinearRegression_model.fit(X_train, y_train)
Out[43]:
LinearRegression()
In [44]:
# Check Linear Regression model accuracy
accuracy_LinearRegression = LinearRegression_model.score(X_test, y_test)
accuracy_LinearRegression
Out[44]:
0.6817924522132923

Decision Tree Regressor Model

In [46]:
# Import lirbary
from sklearn.tree import DecisionTreeRegressor

# Train the model
DecisionTree_model = DecisionTreeRegressor()
DecisionTree_model.fit(X_train, y_train)
Out[46]:
DecisionTreeRegressor()
In [48]:
# Check Decision Tree model accuracy
accuracy_DecisionTree = DecisionTree_model.score(X_test, y_test)
accuracy_DecisionTree
Out[48]:
0.9827050277688161

XGBoost Regressor Model

In [55]:
# Import library
from xgboost import XGBRegressor

# Train the model
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
Out[55]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)
In [56]:
# Check XGBoost model accuracy
xgb.score(X_test, y_test)
Out[56]:
0.9346029693675422

Artificial Neural Network Model

In [58]:
# Import library
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import Adam

# Create model architecture
optimizer = Adam(learning_rate=0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-07, amsgrad = False)
ANN_model = keras.Sequential()
ANN_model.add(Dense(250, input_dim = 22, kernel_initializer='normal',activation='relu'))
ANN_model.add(Dense(500,activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(1000, activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(1000, activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(500, activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(250, activation = 'relu'))
ANN_model.add(Dropout(0.1))
ANN_model.add(Dense(1, activation = 'linear'))

# Compile the model
ANN_model.compile(loss = 'mse', optimizer = 'adam')

# Check modelsummary
ANN_model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 250)               5750      
_________________________________________________________________
dense_1 (Dense)              (None, 500)               125500    
_________________________________________________________________
dropout (Dropout)            (None, 500)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1000)              501000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1000)              1001000   
_________________________________________________________________
dropout_2 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 500)               500500    
_________________________________________________________________
dropout_3 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 250)               125250    
_________________________________________________________________
dropout_4 (Dropout)          (None, 250)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 251       
=================================================================
Total params: 2,259,251
Trainable params: 2,259,251
Non-trainable params: 0
_________________________________________________________________
In [61]:
# Initiate model training
history = ANN_model.fit(X_train, y_train, epochs = 5, validation_split = 0.2)
Epoch 1/5
4914/4914 [==============================] - 494s 101ms/step - loss: 0.1768 - val_loss: 0.1447
Epoch 2/5
4914/4914 [==============================] - 420s 85ms/step - loss: 0.1346 - val_loss: 0.1148
Epoch 3/5
4914/4914 [==============================] - 656s 134ms/step - loss: 0.1145 - val_loss: 0.0955
Epoch 4/5
4914/4914 [==============================] - 418s 85ms/step - loss: 0.1001 - val_loss: 0.0820
Epoch 5/5
4914/4914 [==============================] - 363s 74ms/step - loss: 0.0911 - val_loss: 0.0764
In [63]:
# Evaluate the ANN model
result = ANN_model.evaluate(X_test, y_test)
accuracy_ANN = 1 - result
print("Accuracy : {}".format(accuracy_ANN))
1536/1536 [==============================] - 16s 11ms/step - loss: 0.0752
Accuracy : 0.9248149991035461

Calculate Regression KPIs

In [72]:
# From the above results, it can be seen that, decision tree model out-performs the other models.

# Plot True values and Model Predictions
y_predict = DecisionTree_model.predict(X_test)
plt.plot(y_predict, y_test, '^', color = 'b')
plt.xlabel('Model Predictions')
plt.ylabel('True Values')
plt.show()
In [75]:
# Plot True values and Model Predictions with their proper range
y_predict_orig = scaler_y.inverse_transform(y_predict)
y_test_orig = scaler_y.inverse_transform(y_test)
plt.plot(y_test_orig, y_predict_orig, "^", color = 'r')
plt.xlabel('Model Predictions')
plt.ylabel('True Values')
plt.show()
In [74]:
# Import libraries
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

# Check evaluation metrics
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE = mean_absolute_error(y_test_orig, y_predict_orig)
r2 = r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
RMSE = 0.148 
MSE = 0.02202306497665563 
MAE = 0.022432923934771264 
R2 = 0.9827050277688161 
Adjusted R2 = 0.9826972811762089

Conclusion

The Decision Tree Model was able to obtained a 98% accuracy in prediction and also outperforms an ANN model. Although the ANN model was only trained in five iterations and can bu further improve by adding more layers, it will still take a lot of time to train and consume a lot of computing power as well. Due to the good scores in evaluation metrics the model can be deploy into the production to save process engineers a lot of time in their workload.