Breast cancer is one of the most common cancer among women worldwide accounting for 25 percent of all cancer cases and affected 2.1 million people in 2015, with early diagnosis, they could significantly increases their chances of survival.The key challenge in cancer detection is how to classify tumors into malignant or benign. Machine learning techniques can dramatically improves the accuracy of diagnosis.
Research indicates that most experienced physicians can diagnose cancer with 79 percent accuracy while 91 percent correct diagnosis is achieved using machine learning techniques.So the first step in the cancer diagnosis process is to do what we call final needle aspirate or any process which is simply extracting some of the cells out of the tumor. And at that stage we don't know if that human is malignant or benign.
When we say benign that means that the tumor is kind of not spreading across the bodies of the patient and is safe somehow. if it's malignant that means it's cancerous that we need to intervene and actually stop the cancer growth. The idea is we want to teach the machine how to basically classify images or classify data and tell us if it's malignant or benign. for example. in this case without any human intervention.
Probem:
Dataset:
Source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Import Cancer data from the Sklearn library
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
# Check cancer dictionaries
cancer.keys()
# Target
print(cancer['target_names'])
# Features
print(cancer['feature_names'])
# Create dataframe
df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))
# Analyze dataset
df_cancer.head()
# check data description
df_cancer.describe()
# Check for more info
df_cancer.info()
#Rearange columns
df_cancer = df_cancer[['target','mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error',
'fractal dimension error', 'worst radius', 'worst texture',
'worst perimeter', 'worst area', 'worst smoothness',
'worst compactness', 'worst concavity', 'worst concave points',
'worst symmetry', 'worst fractal dimension']]
# Check correlation in the features by using heatmap
from heatmap import corrplot
plt.figure(figsize=(8, 8))
corrplot(df_cancer.corr(), size_scale=300)
# Visualize 'mean radius', 'mean texture', 'mean area', 'mean perimeter' and 'mean smoothness' relating to 'Target'
sns.pairplot(df_cancer, hue = 'target', vars = ['mean radius', 'mean texture', 'mean area', 'mean perimeter', 'mean smoothness'] )
# Visualize 'count', 'mean area' and 'mean smoothness' relating to 'target'
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (20, 6))
sns.countplot(df_cancer['target'], label = "Count", ax = ax1)
ax1.set_xlabel("Target");
ax1.set_ylabel("Count");
sns.scatterplot(x = 'mean area', y = 'mean smoothness', hue = 'target', alpha= 0.5, data = df_cancer, ax = ax2)
# Define the indepedent and dependent variables
x = df_cancer.drop(['target'], axis = 1)
y = df_cancer['target']
# Split the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)
x_train.shape, y_train.shape, x_test.shape, y_test.shape
# Scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
X_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)
# Create Machine Learning models
def models(X_train, y_train):
# Use Logistic Regression
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(random_state = 0)
log.fit(X_train, y_train)
# use KNeighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
# Use SVM (Linear Kernel)
from sklearn.svm import SVC
svc_lin = SVC(kernel = 'linear', random_state = 0)
svc_lin.fit(X_train, y_train)
# Use SVM (RBF Kernel)
from sklearn.svm import SVC
svc_rbf = SVC(kernel = 'rbf', random_state = 0)
svc_rbf.fit(X_train, y_train)
# Use GaussianNB
from sklearn.naive_bayes import GaussianNB
gauss = GaussianNB()
gauss.fit(X_train, y_train)
# Use Decision Tree
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
tree.fit(X_train, y_train)
# Use RandomforestClassifier
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(criterion = 'entropy', random_state = 0)
forest.fit(X_train, y_train)
# Use XGBoost Classifer
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
# Print the training accuracy for each model
print('model[0] Logistic Regression Training Accuracy: ', log.score(X_train, y_train))
print('model[1] KNNeighbors Training Accuracy: ', knn.score(X_train, y_train))
print('model[2] SVC Linear Training Accuracy: ', svc_lin.score(X_train, y_train))
print('model[3] SVC RBF Training Accuracy: ', svc_rbf.score(X_train, y_train))
print('model[4] Gauss Training Accuracy: ', gauss.score(X_train, y_train))
print('model[5] Decision Tree Training Accuracy: ', tree.score(X_train, y_train))
print('model[6] Random Forest Training Accuracy: ', forest.score(X_train, y_train))
print('model[7] XGBoost Training Accuracy: ', xgb.score(X_train, y_train))
return log, knn, svc_lin, svc_rbf, gauss, tree, forest, xgb
# Show Training accuracy score
model = models(X_train, y_train)
# Show the confusion matrix and accuracy for all the models on the test data
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
for i in range( len(model) ):
cm = confusion_matrix(y_test, model[i].predict((X_test)))
# Extract TN, FP, FN, TP
TN, FP, FN, TP = confusion_matrix(y_test, model[i].predict(X_test)).ravel()
test_score = (TP + TN) / (TN + FP + FN + TP)
print('model[{}] Testing Accuracy: "{}"'.format(i, test_score))
# Create confusion matrix function
y_predict = model[7].predict(X_test)
cm = confusion_matrix(y_test, y_predict)
# Show confusion matrix
sns.heatmap(cm, annot=True)
# Check f1, precision and recall score
from sklearn.metrics import f1_score, precision_score, recall_score
UsedFitter = model[7]
print('Precision Score:', precision_score(y_test, UsedFitter.predict(X_test)))
print('Recall Score:', recall_score(y_test, UsedFitter.predict(X_test)))
print('f1_score:', f1_score(y_test, UsedFitter.predict(X_test)))
# Implement-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = UsedFitter, X = X_train, y = y_train, cv = 10)
print('Mean Accuracy:', accuracies.mean())
print('Standard Deviation:', accuracies.std())
# Get feature importance
importances = pd.DataFrame({'feature': x.columns, 'importance': np.round(UsedFitter.feature_importances_, 3)})
importances = importances.sort_values('importance', ascending = False).set_index('feature')
print(importances.head(10))
# Visualize the importance
importances.head(10).plot.bar()
plt.show()
The importance of certain features varies to different model but in this case we used XGboost method which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models. In this model, we could see that concave points has a great influence whether a patient is malignant or benign.
Machine learning technique was able to classify tumors into malignant or benign with 96% accuracy. The technique can rapidly evaluate breast masses and classify them in an automated fashion. Early breast cancer can dramatically save lives especially in developing world. The technique can be further improved by combining computer vision/ML techniques to directly classify cancer using tissue images.