上QQ阅读APP看书，第一时间看更新

KNN classifier with breast cancer Wisconsin data example

Breast cancer data has been utilized from the UCI machine learning repository http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 for illustration purposes. Here the task is to find whether the cancer is malignant or benign based on various collected features such as clump thickness and so on using the KNN classifier:

# KNN Classifier - Breast Cancer 
>>> import numpy as np 
>>> import pandas as pd 
>>> from sklearn.metrics import accuracy_score,classification_report 
>>> breast_cancer = pd.read_csv("Breast_Cancer_Wisconsin.csv")

The following are the first few rows to show how the data looks like. The Class value has class 2 and 4. Value 2 and 4 represent benign and malignant class, respectively. Whereas all the other variables do vary between value 1 and 10, which are very much categorical in nature:

Only the Bare_Nuclei variable has some missing values, here we are replacing them with the most frequent value (category value 1) in the following code:

>>> breast_cancer['Bare_Nuclei'] = breast_cancer['Bare_Nuclei'].replace('?', np.NAN) 
>>> breast_cancer['Bare_Nuclei'] = breast_cancer['Bare_Nuclei'].fillna(breast_cancer[ 'Bare_Nuclei'].value_counts().index[0])

Use the following code to convert the classes to a 0 and 1 indicator for using in the classifier:

>>> breast_cancer['Cancer_Ind'] = 0 
>>> breast_cancer.loc[breast_cancer['Class']==4,'Cancer_Ind'] = 1

In the following code, we are dropping non-value added variables from analysis:

>>> x_vars = breast_cancer.drop(['ID_Number','Class','Cancer_Ind'],axis=1) 
>>> y_var = breast_cancer['Cancer_Ind'] 
>>> from sklearn.preprocessing import StandardScaler 
>>> x_vars_stdscle = StandardScaler().fit_transform(x_vars.values) 
>>> from sklearn.model_selection import train_test_split

As KNN is very sensitive to distances, here we are standardizing all the columns before applying algorithms:

>>> x_vars_stdscle_df = pd.DataFrame(x_vars_stdscle, index=x_vars.index, columns=x_vars.columns) 
>>> x_train,x_test,y_train,y_test = train_test_split(x_vars_stdscle_df,y_var, train_size = 0.7,random_state=42)

KNN classifier is being applied with neighbor value of 3 and p value indicates it is 2-norm, also known as Euclidean distance for computing classes:

>>> from sklearn.neighbors import KNeighborsClassifier 
>>> knn_fit = KNeighborsClassifier(n_neighbors=3,p=2,metric='minkowski') 
>>> knn_fit.fit(x_train,y_train) 
 
>>> print ("\nK-Nearest Neighbors - Train Confusion Matrix\n\n",pd.crosstab(y_train, knn_fit.predict(x_train),rownames = ["Actuall"],colnames = ["Predicted"]) )      
>>> print ("\nK-Nearest Neighbors - Train accuracy:",round(accuracy_score(y_train, knn_fit.predict(x_train)),3)) 
>>> print ("\nK-Nearest Neighbors - Train Classification Report\n", classification_report( y_train,knn_fit.predict(x_train))) 
 
>>> print ("\n\nK-Nearest Neighbors - Test Confusion Matrix\n\n",pd.crosstab(y_test, knn_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"]))       
>>> print ("\nK-Nearest Neighbors - Test accuracy:",round(accuracy_score( y_test,knn_fit.predict(x_test)),3)) 
>>> print ("\nK-Nearest Neighbors - Test Classification Report\n", classification_report(y_test,knn_fit.predict(x_test)))

From the results, it is appearing that KNN is working very well in classifying malignant and benign classes well, obtaining test accuracy of 97.6 percent with 96 percent of recall on malignant class. The only deficiency of KNN classifier would be, it is computationally intensive during test phase, as each test observation will be compared with all the available observations in train data, which practically KNN does not learn a thing from training data. Hence, we are also calling it a lazy classifier!

The R code for KNN classifier is as follows:

# KNN Classifier 
setwd("D:\\Book writing\\Codes\\Chapter 5") 
breast_cancer = read.csv("Breast_Cancer_Wisconsin.csv") 

# Column Bare_Nuclei have some missing values with "?" in place, we are replacing with median values 
# As Bare_Nuclei is discrete variable 
breast_cancer$Bare_Nuclei = as.character(breast_cancer$Bare_Nuclei)
breast_cancer$Bare_Nuclei[breast_cancer$Bare_Nuclei=="?"] = median(breast_cancer$Bare_Nuclei,na.rm = TRUE)
breast_cancer$Bare_Nuclei = as.integer(breast_cancer$Bare_Nuclei)
# Classes are 2 & 4 for benign & malignant respectively, we # have converted # 
to zero-one problem, as it is easy to convert to work # around with models 
breast_cancer$Cancer_Ind = 0
breast_cancer$Cancer_Ind[breast_cancer$Class==4]=1
breast_cancer$Cancer_Ind = as.factor( breast_cancer$Cancer_Ind) 

# We have removed unique id number from modeling as unique # numbers does not provide value in modeling 
# In addition, original class variable also will be removed # as the same has been replaced with derived variable 

remove_cols = c("ID_Number","Class") 
breast_cancer_new = breast_cancer[,!(names(breast_cancer) %in% remove_cols)] 

# Setting seed value for producing repetitive results 
# 70-30 split has been made on the data 

set.seed(123) 
numrow = nrow(breast_cancer_new) 
trnind = sample(1:numrow,size = as.integer(0.7*numrow)) 
train_data = breast_cancer_new[trnind,] 
test_data = breast_cancer_new[-trnind,] 

# Following is classical code for computing accuracy, # precision & recall 

frac_trzero = (table(train_data$Cancer_Ind)[[1]])/nrow(train_data)
frac_trone = (table(train_data$Cancer_Ind)[[2]])/nrow(train_data)

frac_tszero = (table(test_data$Cancer_Ind)[[1]])/nrow(test_data)
frac_tsone = (table(test_data$Cancer_Ind)[[2]])/nrow(test_data)

prec_zero <- function(act,pred){ tble = table(act,pred)
return( round( tble[1,1]/(tble[1,1]+tble[2,1]),4) ) } 

prec_one <- function(act,pred){ tble = table(act,pred)
return( round( tble[2,2]/(tble[2,2]+tble[1,2]),4) ) } 

recl_zero <- function(act,pred){tble = table(act,pred) 
return( round( tble[1,1]/(tble[1,1]+tble[1,2]),4) ) } 

recl_one <- function(act,pred){ tble = table(act,pred) 
return( round( tble[2,2]/(tble[2,2]+tble[2,1]),4) ) } 

accrcy <- function(act,pred){ tble = table(act,pred) 
return( round((tble[1,1]+tble[2,2])/sum(tble),4)) } 

# Importing Class package in which KNN function do present library(class) 

# Choosing sample k-value as 3 & apply on train & test data # respectively 

k_value = 3 
tr_y_pred = knn(train_data,train_data,train_data$Cancer_Ind,k=k_value)
ts_y_pred = knn(train_data,test_data,train_data$Cancer_Ind,k=k_value) 

# Calculating confusion matrix, accuracy, precision & # recall on train data 

tr_y_act = train_data$Cancer_Ind;ts_y_act = test_data$Cancer_Ind
tr_tble = table(tr_y_act,tr_y_pred) 
print(paste("Train Confusion Matrix")) 
print(tr_tble) 

tr_acc = accrcy(tr_y_act,tr_y_pred) 
trprec_zero = prec_zero(tr_y_act,tr_y_pred); trrecl_zero = recl_zero(tr_y_act,tr_y_pred) 
trprec_one = prec_one(tr_y_act,tr_y_pred); trrecl_one = recl_one(tr_y_act,tr_y_pred) 
trprec_ovll = trprec_zero *frac_trzero + trprec_one*frac_trone
trrecl_ovll = trrecl_zero *frac_trzero + trrecl_one*frac_trone

print(paste("KNN Train accuracy:",tr_acc)) 
print(paste("KNN - Train Classification Report"))
print(paste("Zero_Precision",trprec_zero,"Zero_Recall",trrecl_zero))
print(paste("One_Precision",trprec_one,"One_Recall",trrecl_one))
print(paste("Overall_Precision",round(trprec_ovll,4),"Overall_Recall",round(trrecl_ovll,4))) 

# Calculating confusion matrix, accuracy, precision & # recall on test data 

ts_tble = table(ts_y_act, ts_y_pred) 
print(paste("Test Confusion Matrix")) 
print(ts_tble) 

ts_acc = accrcy(ts_y_act,ts_y_pred) 
tsprec_zero = prec_zero(ts_y_act,ts_y_pred); tsrecl_zero = recl_zero(ts_y_act,ts_y_pred) 
tsprec_one = prec_one(ts_y_act,ts_y_pred); tsrecl_one = recl_one(ts_y_act,ts_y_pred) 

tsprec_ovll = tsprec_zero *frac_tszero + tsprec_one*frac_tsone
tsrecl_ovll = tsrecl_zero *frac_tszero + tsrecl_one*frac_tsone

print(paste("KNN Test accuracy:",ts_acc)) 
print(paste("KNN - Test Classification Report"))
print(paste("Zero_Precision",tsprec_zero,"Zero_Recall",tsrecl_zero))
print(paste("One_Precision",tsprec_one,"One_Recall",tsrecl_one))
print(paste("Overall_Precision",round(tsprec_ovll,4),"Overall_Recall",round(tsrecl_ovll,4)))