Case Study: Breast Cancer Detection¶

Girl in a jacket

Introduction¶

Breast cancer is one of the most common cancer among women worldwide accounting for 25 percent of all cancer cases and affected 2.1 million people in 2015, with early diagnosis, they could significantly increases their chances of survival.The key challenge in cancer detection is how to classify tumors into malignant or benign. Machine learning techniques can dramatically improves the accuracy of diagnosis.

Research indicates that most experienced physicians can diagnose cancer with 79 percent accuracy while 91 percent correct diagnosis is achieved using machine learning techniques.So the first step in the cancer diagnosis process is to do what we call final needle aspirate or any process which is simply extracting some of the cells out of the tumor. And at that stage we don't know if that human is malignant or benign.

When we say benign that means that the tumor is kind of not spreading across the bodies of the patient and is safe somehow. if it's malignant that means it's cancerous that we need to intervene and actually stop the cancer growth. The idea is we want to teach the machine how to basically classify images or classify data and tell us if it's malignant or benign. for example. in this case without any human intervention.

Probem:

Predict if the cancer diagnosis is benign or malignant based on several observations/features

Dataset:

Radius (mean of distances from center to points on the perimeter)
Texture (standard deviation of gray-scale values)
Perimeter
Area
Smoothness (local variation in radius lengths)
Compactness (perimeter^2 / area - 1.0)
Concavity (severity of concave portions of the contour)
Concave points (number of concave portions of the contour)
Symmetry
Fractal dimension ("coastline approximation" - 1)

Source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Libraries and dataset importation¶

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Import Cancer data from the Sklearn library
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

Data exploration¶

# Check cancer dictionaries
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

# Target
print(cancer['target_names'])

['malignant' 'benign']

# Features
print(cancer['feature_names'])

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

# Create dataframe
df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))

# Analyze dataset
df_cancer.head()

# check data description
df_cancer.describe()

# Check for more info
df_cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   target                   569 non-null    float64
 1   mean radius              569 non-null    float64
 2   mean texture             569 non-null    float64
 3   mean perimeter           569 non-null    float64
 4   mean area                569 non-null    float64
 5   mean smoothness          569 non-null    float64
 6   mean compactness         569 non-null    float64
 7   mean concavity           569 non-null    float64
 8   mean concave points      569 non-null    float64
 9   mean symmetry            569 non-null    float64
 10  mean fractal dimension   569 non-null    float64
 11  radius error             569 non-null    float64
 12  texture error            569 non-null    float64
 13  perimeter error          569 non-null    float64
 14  area error               569 non-null    float64
 15  smoothness error         569 non-null    float64
 16  compactness error        569 non-null    float64
 17  concavity error          569 non-null    float64
 18  concave points error     569 non-null    float64
 19  symmetry error           569 non-null    float64
 20  fractal dimension error  569 non-null    float64
 21  worst radius             569 non-null    float64
 22  worst texture            569 non-null    float64
 23  worst perimeter          569 non-null    float64
 24  worst area               569 non-null    float64
 25  worst smoothness         569 non-null    float64
 26  worst compactness        569 non-null    float64
 27  worst concavity          569 non-null    float64
 28  worst concave points     569 non-null    float64
 29  worst symmetry           569 non-null    float64
 30  worst fractal dimension  569 non-null    float64
dtypes: float64(31)
memory usage: 137.9 KB

Data visualization¶

#Rearange columns
df_cancer = df_cancer[['target','mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension']]

# Check correlation in the features by using heatmap
from heatmap import corrplot
plt.figure(figsize=(8, 8))
corrplot(df_cancer.corr(), size_scale=300)

# Visualize 'mean radius', 'mean texture', 'mean area', 'mean perimeter' and 'mean smoothness' relating to 'Target'
sns.pairplot(df_cancer, hue = 'target', vars = ['mean radius', 'mean texture', 'mean area', 'mean perimeter', 'mean smoothness'] )

<seaborn.axisgrid.PairGrid at 0x20f43f56f70>

# Visualize 'count', 'mean area' and 'mean smoothness' relating to 'target'
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (20, 6))

sns.countplot(df_cancer['target'], label = "Count", ax = ax1) 
ax1.set_xlabel("Target");
ax1.set_ylabel("Count");

sns.scatterplot(x = 'mean area', y = 'mean smoothness', hue = 'target', alpha= 0.5, data = df_cancer, ax = ax2)

<matplotlib.axes._subplots.AxesSubplot at 0x20f4260a940>

Data pre-processing¶

# Define the indepedent and dependent variables
x = df_cancer.drop(['target'], axis = 1)
y = df_cancer['target']

# Split the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((455, 30), (455,), (114, 30), (114,))

# Scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
X_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)

Model creation¶

# Create Machine Learning models
def models(X_train, y_train):
    
    # Use Logistic Regression
    from sklearn.linear_model import LogisticRegression
    log = LogisticRegression(random_state = 0)
    log.fit(X_train, y_train)
    
    # use KNeighbors
    from sklearn.neighbors import KNeighborsClassifier
    knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
    knn.fit(X_train, y_train)
    
    # Use SVM (Linear Kernel)
    from sklearn.svm import SVC
    svc_lin = SVC(kernel = 'linear', random_state = 0)
    svc_lin.fit(X_train, y_train)
    
    # Use SVM (RBF Kernel)
    from sklearn.svm import SVC
    svc_rbf = SVC(kernel = 'rbf', random_state = 0)
    svc_rbf.fit(X_train, y_train)
    
    # Use GaussianNB
    from sklearn.naive_bayes import GaussianNB
    gauss = GaussianNB()
    gauss.fit(X_train, y_train)
    
    # Use Decision Tree
    from sklearn.tree import DecisionTreeClassifier
    tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    tree.fit(X_train, y_train)
    
    # Use RandomforestClassifier
    from sklearn.ensemble import RandomForestClassifier
    forest = RandomForestClassifier(criterion = 'entropy', random_state = 0)
    forest.fit(X_train, y_train)
    
    # Use XGBoost Classifer
    from xgboost import XGBClassifier
    xgb = XGBClassifier()
    xgb.fit(X_train, y_train)
    
    # Print the training accuracy for each model
    print('model[0] Logistic Regression Training Accuracy: ', log.score(X_train, y_train))
    print('model[1] KNNeighbors Training Accuracy: ', knn.score(X_train, y_train))
    print('model[2] SVC Linear Training Accuracy: ', svc_lin.score(X_train, y_train))
    print('model[3] SVC RBF Training Accuracy: ', svc_rbf.score(X_train, y_train))
    print('model[4] Gauss Training Accuracy: ', gauss.score(X_train, y_train))
    print('model[5] Decision Tree  Training Accuracy: ', tree.score(X_train, y_train))
    print('model[6] Random Forest Training Accuracy: ', forest.score(X_train, y_train))
    print('model[7] XGBoost Training Accuracy: ', xgb.score(X_train, y_train))
    
    return log, knn, svc_lin, svc_rbf, gauss, tree, forest, xgb

Model evaluation¶

# Show Training accuracy score
model = models(X_train, y_train)

model[0] Logistic Regression Training Accuracy:  0.989010989010989
model[1] KNNeighbors Training Accuracy:  0.978021978021978
model[2] SVC Linear Training Accuracy:  0.989010989010989
model[3] SVC RBF Training Accuracy:  0.9846153846153847
model[4] Gauss Training Accuracy:  0.9472527472527472
model[5] Decision Tree  Training Accuracy:  1.0
model[6] Random Forest Training Accuracy:  1.0
model[7] XGBoost Training Accuracy:  1.0

# Show the confusion matrix and accuracy for all the models on the test data
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

for i in range( len(model) ):
    cm = confusion_matrix(y_test, model[i].predict((X_test)))
    
    # Extract TN, FP, FN, TP
    TN, FP, FN, TP = confusion_matrix(y_test, model[i].predict(X_test)).ravel()
    
    test_score = (TP + TN) / (TN + FP + FN + TP)
    
    print('model[{}] Testing Accuracy: "{}"'.format(i, test_score))

model[0] Testing Accuracy: "0.9649122807017544"
model[1] Testing Accuracy: "0.956140350877193"
model[2] Testing Accuracy: "0.9824561403508771"
model[3] Testing Accuracy: "0.9824561403508771"
model[4] Testing Accuracy: "0.9035087719298246"
model[5] Testing Accuracy: "0.9298245614035088"
model[6] Testing Accuracy: "0.9736842105263158"
model[7] Testing Accuracy: "0.9824561403508771"

# Create confusion matrix function
y_predict = model[7].predict(X_test)
cm = confusion_matrix(y_test, y_predict)

# Show confusion matrix
sns.heatmap(cm, annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x20f468af310>

# Check f1, precision and recall score
from sklearn.metrics import f1_score, precision_score, recall_score
UsedFitter = model[7]

print('Precision Score:', precision_score(y_test, UsedFitter.predict(X_test)))
print('Recall Score:', recall_score(y_test, UsedFitter.predict(X_test)))
print('f1_score:', f1_score(y_test, UsedFitter.predict(X_test)))

Precision Score: 0.9710144927536232
Recall Score: 1.0
f1_score: 0.9852941176470589

# Implement-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = UsedFitter, X = X_train, y = y_train, cv = 10)
print('Mean Accuracy:', accuracies.mean())
print('Standard Deviation:', accuracies.std())

Mean Accuracy: 0.9604347826086956
Standard Deviation: 0.02757472519245102

# Get feature importance
importances = pd.DataFrame({'feature': x.columns, 'importance': np.round(UsedFitter.feature_importances_, 3)})
importances = importances.sort_values('importance', ascending = False).set_index('feature')
print(importances.head(10))

                      importance
feature                         
worst concave points       0.353
mean concave points        0.211
worst perimeter            0.077
mean radius                0.075
worst area                 0.056
mean area                  0.031
worst concavity            0.027
radius error               0.026
worst texture              0.018
mean symmetry              0.016

# Visualize the importance
importances.head(10).plot.bar()
plt.show()

Conlusion¶

The importance of certain features varies to different model but in this case we used XGboost method which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models. In this model, we could see that concave points has a great influence whether a patient is malignant or benign.

Machine learning technique was able to classify tumors into malignant or benign with 96% accuracy. The technique can rapidly evaluate breast masses and classify them in an automated fashion. Early breast cancer can dramatically save lives especially in developing world. The technique can be further improved by combining computer vision/ML techniques to directly classify cancer using tissue images.

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	target
count	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	...	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000
mean	14.127292	19.289649	91.969033	654.889104	0.096360	0.104341	0.088799	0.048919	0.181162	0.062798	...	25.677223	107.261213	880.583128	0.132369	0.254265	0.272188	0.114606	0.290076	0.083946	0.627417
std	3.524049	4.301036	24.298981	351.914129	0.014064	0.052813	0.079720	0.038803	0.027414	0.007060	...	6.146258	33.602542	569.356993	0.022832	0.157336	0.208624	0.065732	0.061867	0.018061	0.483918
min	6.981000	9.710000	43.790000	143.500000	0.052630	0.019380	0.000000	0.000000	0.106000	0.049960	...	12.020000	50.410000	185.200000	0.071170	0.027290	0.000000	0.000000	0.156500	0.055040	0.000000
25%	11.700000	16.170000	75.170000	420.300000	0.086370	0.064920	0.029560	0.020310	0.161900	0.057700	...	21.080000	84.110000	515.300000	0.116600	0.147200	0.114500	0.064930	0.250400	0.071460	0.000000
50%	13.370000	18.840000	86.240000	551.100000	0.095870	0.092630	0.061540	0.033500	0.179200	0.061540	...	25.410000	97.660000	686.500000	0.131300	0.211900	0.226700	0.099930	0.282200	0.080040	1.000000
75%	15.780000	21.800000	104.100000	782.700000	0.105300	0.130400	0.130700	0.074000	0.195700	0.066120	...	29.720000	125.400000	1084.000000	0.146000	0.339100	0.382900	0.161400	0.317900	0.092080	1.000000
max	28.110000	39.280000	188.500000	2501.000000	0.163400	0.345400	0.426800	0.201200	0.304000	0.097440	...	49.540000	251.200000	4254.000000	0.222600	1.058000	1.252000	0.291000	0.663800	0.207500	1.000000