Case Study: Breast Cancer Detection

Girl in a jacket

Introduction

Breast cancer is one of the most common cancer among women worldwide accounting for 25 percent of all cancer cases and affected 2.1 million people in 2015, with early diagnosis, they could significantly increases their chances of survival.The key challenge in cancer detection is how to classify tumors into malignant or benign. Machine learning techniques can dramatically improves the accuracy of diagnosis.

Research indicates that most experienced physicians can diagnose cancer with 79 percent accuracy while 91 percent correct diagnosis is achieved using machine learning techniques.So the first step in the cancer diagnosis process is to do what we call final needle aspirate or any process which is simply extracting some of the cells out of the tumor. And at that stage we don't know if that human is malignant or benign.

When we say benign that means that the tumor is kind of not spreading across the bodies of the patient and is safe somehow. if it's malignant that means it's cancerous that we need to intervene and actually stop the cancer growth. The idea is we want to teach the machine how to basically classify images or classify data and tell us if it's malignant or benign. for example. in this case without any human intervention.

Probem:

  • Predict if the cancer diagnosis is benign or malignant based on several observations/features

Dataset:

  • Radius (mean of distances from center to points on the perimeter)
  • Texture (standard deviation of gray-scale values)
  • Perimeter
  • Area
  • Smoothness (local variation in radius lengths)
  • Compactness (perimeter^2 / area - 1.0)
  • Concavity (severity of concave portions of the contour)
  • Concave points (number of concave portions of the contour)
  • Symmetry
  • Fractal dimension ("coastline approximation" - 1)

Source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Libraries and dataset importation

In [17]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
In [18]:
# Import Cancer data from the Sklearn library
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

Data exploration

In [19]:
# Check cancer dictionaries
cancer.keys()
Out[19]:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
In [20]:
# Target
print(cancer['target_names'])
['malignant' 'benign']
In [21]:
# Features
print(cancer['feature_names'])
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
In [22]:
# Create dataframe
df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))
In [23]:
# Analyze dataset
df_cancer.head()
Out[23]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0.0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0.0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0.0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0.0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0.0

5 rows × 31 columns

In [24]:
# check data description
df_cancer.describe()
Out[24]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946 0.627417
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061 0.483918
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040 0.000000
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460 0.000000
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040 1.000000
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080 1.000000
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500 1.000000

8 rows × 31 columns

In [42]:
# Check for more info
df_cancer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   target                   569 non-null    float64
 1   mean radius              569 non-null    float64
 2   mean texture             569 non-null    float64
 3   mean perimeter           569 non-null    float64
 4   mean area                569 non-null    float64
 5   mean smoothness          569 non-null    float64
 6   mean compactness         569 non-null    float64
 7   mean concavity           569 non-null    float64
 8   mean concave points      569 non-null    float64
 9   mean symmetry            569 non-null    float64
 10  mean fractal dimension   569 non-null    float64
 11  radius error             569 non-null    float64
 12  texture error            569 non-null    float64
 13  perimeter error          569 non-null    float64
 14  area error               569 non-null    float64
 15  smoothness error         569 non-null    float64
 16  compactness error        569 non-null    float64
 17  concavity error          569 non-null    float64
 18  concave points error     569 non-null    float64
 19  symmetry error           569 non-null    float64
 20  fractal dimension error  569 non-null    float64
 21  worst radius             569 non-null    float64
 22  worst texture            569 non-null    float64
 23  worst perimeter          569 non-null    float64
 24  worst area               569 non-null    float64
 25  worst smoothness         569 non-null    float64
 26  worst compactness        569 non-null    float64
 27  worst concavity          569 non-null    float64
 28  worst concave points     569 non-null    float64
 29  worst symmetry           569 non-null    float64
 30  worst fractal dimension  569 non-null    float64
dtypes: float64(31)
memory usage: 137.9 KB

Data visualization

In [26]:
#Rearange columns
df_cancer = df_cancer[['target','mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension']]
In [27]:
# Check correlation in the features by using heatmap
from heatmap import corrplot
plt.figure(figsize=(8, 8))
corrplot(df_cancer.corr(), size_scale=300)
In [28]:
# Visualize 'mean radius', 'mean texture', 'mean area', 'mean perimeter' and 'mean smoothness' relating to 'Target'
sns.pairplot(df_cancer, hue = 'target', vars = ['mean radius', 'mean texture', 'mean area', 'mean perimeter', 'mean smoothness'] )
Out[28]:
<seaborn.axisgrid.PairGrid at 0x20f43f56f70>
In [29]:
# Visualize 'count', 'mean area' and 'mean smoothness' relating to 'target'
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (20, 6))

sns.countplot(df_cancer['target'], label = "Count", ax = ax1) 
ax1.set_xlabel("Target");
ax1.set_ylabel("Count");

sns.scatterplot(x = 'mean area', y = 'mean smoothness', hue = 'target', alpha= 0.5, data = df_cancer, ax = ax2)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f4260a940>

Data pre-processing

In [30]:
# Define the indepedent and dependent variables
x = df_cancer.drop(['target'], axis = 1)
y = df_cancer['target']
In [31]:
# Split the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)
x_train.shape, y_train.shape, x_test.shape, y_test.shape
Out[31]:
((455, 30), (455,), (114, 30), (114,))
In [32]:
# Scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
X_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)

Model creation

In [33]:
# Create Machine Learning models
def models(X_train, y_train):
    
    # Use Logistic Regression
    from sklearn.linear_model import LogisticRegression
    log = LogisticRegression(random_state = 0)
    log.fit(X_train, y_train)
    
    # use KNeighbors
    from sklearn.neighbors import KNeighborsClassifier
    knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
    knn.fit(X_train, y_train)
    
    # Use SVM (Linear Kernel)
    from sklearn.svm import SVC
    svc_lin = SVC(kernel = 'linear', random_state = 0)
    svc_lin.fit(X_train, y_train)
    
    # Use SVM (RBF Kernel)
    from sklearn.svm import SVC
    svc_rbf = SVC(kernel = 'rbf', random_state = 0)
    svc_rbf.fit(X_train, y_train)
    
    # Use GaussianNB
    from sklearn.naive_bayes import GaussianNB
    gauss = GaussianNB()
    gauss.fit(X_train, y_train)
    
    # Use Decision Tree
    from sklearn.tree import DecisionTreeClassifier
    tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    tree.fit(X_train, y_train)
    
    # Use RandomforestClassifier
    from sklearn.ensemble import RandomForestClassifier
    forest = RandomForestClassifier(criterion = 'entropy', random_state = 0)
    forest.fit(X_train, y_train)
    
    # Use XGBoost Classifer
    from xgboost import XGBClassifier
    xgb = XGBClassifier()
    xgb.fit(X_train, y_train)
    
    # Print the training accuracy for each model
    print('model[0] Logistic Regression Training Accuracy: ', log.score(X_train, y_train))
    print('model[1] KNNeighbors Training Accuracy: ', knn.score(X_train, y_train))
    print('model[2] SVC Linear Training Accuracy: ', svc_lin.score(X_train, y_train))
    print('model[3] SVC RBF Training Accuracy: ', svc_rbf.score(X_train, y_train))
    print('model[4] Gauss Training Accuracy: ', gauss.score(X_train, y_train))
    print('model[5] Decision Tree  Training Accuracy: ', tree.score(X_train, y_train))
    print('model[6] Random Forest Training Accuracy: ', forest.score(X_train, y_train))
    print('model[7] XGBoost Training Accuracy: ', xgb.score(X_train, y_train))
    
    return log, knn, svc_lin, svc_rbf, gauss, tree, forest, xgb

Model evaluation

In [34]:
# Show Training accuracy score
model = models(X_train, y_train)
model[0] Logistic Regression Training Accuracy:  0.989010989010989
model[1] KNNeighbors Training Accuracy:  0.978021978021978
model[2] SVC Linear Training Accuracy:  0.989010989010989
model[3] SVC RBF Training Accuracy:  0.9846153846153847
model[4] Gauss Training Accuracy:  0.9472527472527472
model[5] Decision Tree  Training Accuracy:  1.0
model[6] Random Forest Training Accuracy:  1.0
model[7] XGBoost Training Accuracy:  1.0
In [35]:
# Show the confusion matrix and accuracy for all the models on the test data
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

for i in range( len(model) ):
    cm = confusion_matrix(y_test, model[i].predict((X_test)))
    
    # Extract TN, FP, FN, TP
    TN, FP, FN, TP = confusion_matrix(y_test, model[i].predict(X_test)).ravel()
    
    test_score = (TP + TN) / (TN + FP + FN + TP)
    
    print('model[{}] Testing Accuracy: "{}"'.format(i, test_score))
model[0] Testing Accuracy: "0.9649122807017544"
model[1] Testing Accuracy: "0.956140350877193"
model[2] Testing Accuracy: "0.9824561403508771"
model[3] Testing Accuracy: "0.9824561403508771"
model[4] Testing Accuracy: "0.9035087719298246"
model[5] Testing Accuracy: "0.9298245614035088"
model[6] Testing Accuracy: "0.9736842105263158"
model[7] Testing Accuracy: "0.9824561403508771"
In [36]:
# Create confusion matrix function
y_predict = model[7].predict(X_test)
cm = confusion_matrix(y_test, y_predict)
In [37]:
# Show confusion matrix
sns.heatmap(cm, annot=True)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f468af310>
In [38]:
# Check f1, precision and recall score
from sklearn.metrics import f1_score, precision_score, recall_score
UsedFitter = model[7]

print('Precision Score:', precision_score(y_test, UsedFitter.predict(X_test)))
print('Recall Score:', recall_score(y_test, UsedFitter.predict(X_test)))
print('f1_score:', f1_score(y_test, UsedFitter.predict(X_test)))
Precision Score: 0.9710144927536232
Recall Score: 1.0
f1_score: 0.9852941176470589
In [39]:
# Implement-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = UsedFitter, X = X_train, y = y_train, cv = 10)
print('Mean Accuracy:', accuracies.mean())
print('Standard Deviation:', accuracies.std())
Mean Accuracy: 0.9604347826086956
Standard Deviation: 0.02757472519245102
In [40]:
# Get feature importance
importances = pd.DataFrame({'feature': x.columns, 'importance': np.round(UsedFitter.feature_importances_, 3)})
importances = importances.sort_values('importance', ascending = False).set_index('feature')
print(importances.head(10))
                      importance
feature                         
worst concave points       0.353
mean concave points        0.211
worst perimeter            0.077
mean radius                0.075
worst area                 0.056
mean area                  0.031
worst concavity            0.027
radius error               0.026
worst texture              0.018
mean symmetry              0.016
In [41]:
# Visualize the importance
importances.head(10).plot.bar()
plt.show()

Conlusion

The importance of certain features varies to different model but in this case we used XGboost method which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models. In this model, we could see that concave points has a great influence whether a patient is malignant or benign.

Machine learning technique was able to classify tumors into malignant or benign with 96% accuracy. The technique can rapidly evaluate breast masses and classify them in an automated fashion. Early breast cancer can dramatically save lives especially in developing world. The technique can be further improved by combining computer vision/ML techniques to directly classify cancer using tissue images.