Comparative Analysis of Performance Metrics for Machine Learning Classifiers with a Focus on Alzheimer’s Disease Data

Alzheimer's disease is a brain memory loss disease. Usually, it will affect persons over 60 years of age. The literature has revealed that it is quite difficult to diagnose the disease, so researchers are trying to predict the disease in the early stage. This paper proposes a framework to classify Alzheimer's patients and to predict the best classification algorithm. The Bestfirst and CfssubsetEval methods are used for feature selection. A multi-class classification is done using machine learning algorithms, namely the naïve Bayes algorithm, the logistic algorithm, the SMO/SMV algorithm and the random forest algorithm. The classification accuracy of the algorithms is 67.68%, 84.58%, 87.42%, and 88.90% respectively. The validation applied is 10-fold cross-validation. Then, a confusion matrix is generated and class-wise performance is analysed to find the best algorithm. The ADNI database is used for the implementation process. To compare the performance of the proposed model, the OASIS dataset is applied to the model with the same algorithms and the accuracy of the algorithms is 98%, 99%, 99% and 100% respectively. Also, the time for the model construction is compared for both datasets. The proposed work is compared with existing studies to check the efficiency of the proposed model.


Introduction
Alzheimer's disease (AD) is a type of brain disorder. It affects memory little by little and in the end, it will not allow the person to think. It happens due to death of memory cells in the brain. A protein is deposited in the form of an amyloid and tangles in the brain. Due to this, memory cells die and in the end, the size of the brain shrinks, so memory loss occurs. The initial stage of brain damage affects the hippocampus. The next stage of memory deficiency is mild cognitive impairment; widespread death of memory cells leads to the severe stage and inevitably death occurs. AD is the sixth leading disease that causes death in the United States (National Institute on Aging, 2021).

Stages of Alzheimer's disease
Alzheimer's disease has been categorized into four stages (MCI Screen, 2021); they are shown in Figure 1:

• Cognitive normal (CN)
Cognitive normal is the normal cognitive ageing process (MCI Screen, 2021;Han et al., 2020). People in this category experience healthy ageing. They do not have any AD symptoms.

• Early mild cognitive impairment (EMCI)
Early mild cognitive impairment is the early stage of AD (MCI Screen, 2021;Guo et al., 2020). In this stage, the small changes in the cognitive normal are considered EMCI. Not all EMCI stages progress to AD: some of the EMCI does return to the cognitive normal stage. Thus, this stage is considered harmless.

• Late mild cognitive impairment (LMCI)
Late mild cognitive impairment is the next stage of EMCI (MCI Screen, 2021; Guo et al., 2020). Most patients in this stage will progress to AD. Few patients return to the EMCI stage.

• Alzheimer's disease (AD)
This is the final stage of the memory loss disease (MCI Screen, 2021;Han et al., 2020;Guo et al., 2020). This is an incurable stage.

Figure1. Stages of Alzheimer's disease.
The first stage is the CN; in this stage, people can think and react. This happens due to the normal ageing process. The next stage is EMCI; it affects the medial temporal lobe in the hippocampus and exhibits symptoms of short-term memory loss. It progresses to the next level called LMCI, which affects the lateral and parietal lobes of the brain. Symptoms of this stage include reading problems, poor object recognition and poor sense of direction. The next stage is AD, which affects the front temporal lobe and the occipital lobe of the brain. Symptoms of this stage include poor judgment, impulsivity, short attention and visual problems. The first level of AD starts from the entorhinal cortex and the hippocampus, and then it slowly influences other parts of the brain. Researchers have identified that 80% of people with MCI progress to AD within 6 years. The Mayo clinic found that 15-20% of MCI patients progress to AD each year and the progression rate is 1-2% (MCI Screen, 2021).

Literature review
A pre-training model was introduced by Han et al. (2020) for extraction and transfer of features with a focus on age-related attributes. The result was compared with 8 classification models and proved that this is a competitive model for predicting the MCI to AD progression. The future plan was to explore the model performance on multiple neuroimaging attributes. Guo et al. (2020) introduced a cerebral similarity network using the sparse regression algorithm. In the first level, a new dynamic morphological feature was defined, and based on that the network was constructed. SVM classification and the leave-one-out cross-validation were made. The accuracy of this method was 92.31% and was considered a more sensitive biomarker in the prediction of MCI to AD progression. In future work, the validation would be done on a larger dataset.
To find the different stages in AD, the BLS diagnosing model and convolutional variants were used by Gao et al. (2020) and Sivakani and Ansari (2020). The validity of the model was evaluated using MRI data images from the ADNI database. The accuracy of the proposed model was compared with the PCA-SVM and the InceptionNet techniques and proved to be the best model. Feature extraction was also a focus of the paper. Syed et al. (2020) proposed an ensemble classification model using the linear SVM and LR algorithms. Recursive feature elimination (RFE) and L1 regularization method were applied for feature selection. An MCI dataset was used for the implementation and the biomarkers focused on were cystatin, matrix metalloproteinases and the tau protein. The proposed model was deployed as a web-based application for the detection of early AD. You et al. (2020) proposed a cascade neural network for faster and more accurate classification. It was a two-step process: feature extraction and classification. Using a sensor, EEG data were collected from patients. Three-way classification was done and the proposed model gave the best results.
A convolutional auto-encoder was proposed by Oh et al. (2019) for binary classification. Unsupervised learning technique was applied for the AD, and NC classification and the supervised transfer learning technique was applied for the pMCI and sMCI classification. Gradient-based visualization was done. The accuracy for AD classification and pMCI classification was 86.60% and 73.95% respectively.
A novel framework was proposed by Gupta et al. (2019), based on machine learning techniques. The biomarkers used for the classification were a combination of FDG-PET, sMRI, CSF and APOE. The ADNI database was used for the study. A multi-classification was done using the kernel SVM classifier and the grid-search method. The AU-ROC of this model gave better results than other state-of-the-art methods.
Eitel and Ritter (2019) trained a CNN model for checking the robustness of the methods of gradient input, guided back-propagation, layer-wise relevance propagation and occlusion for AD classification. A visual comparison of the methods was made. Li et al. (2018) introduced the MKSCDDL algorithm for AD classification, and the results were better than other state-of-the-art methods.
Martinez-Murcia et al. (2020) proposed a deep convolutional autoencoder (CAE) model for AD diagnosis. Feature extraction and classification were done using the regression method. More than 80% of classification accuracy was achieved. Maqsood et al. (2019) developed an efficient classification method with the utilization of a pre-trained network, AlexNet, and CNN, and the performance was evaluated using the OASIS database. The accuracy for multi-class classification was 92.85%. Nozadi et al. (2018) designed a pipeline for the group classification with learned features of PET images, and the evaluation was made on the ADNI database with an accuracy of 91.2%. A novel grading biomarker was developed by Tong et al. (2017) using sparse representation techniques with selected features, age and cognitive measurements to provide a more accurate prediction of progression from MCI to AD. The AUC range of this biomarker was 84-92%. Zhang et al. (2019) developed a multi-stateMarkov model to predict progression from MCI to AD. They focused on a time axis instead on the age attribute. They also focused on the internal censoring problem in longitudinal data. Further effort focused on other data types such as genetic data, PET scans, DTI, FMRI and other related research questions. Li et al. (2019) came up with a biomarker using auxiliary data of AD and normal control subjects. In the first step, a projection vector was obtained and then the projection vector was integrated with a self-weight grading to develop the novel biomarker. Finally, the biomarker was developed from multiple morphological features to predict progression from MCI to AD.
A novel EEG-based method was introduced by Mammone et al. (2018) for evaluation of MCI subjects. A dissimilarity matrix was developed using the coupling strength of each pair of the EEG signals; then, hierarchal clustering was applied to the related electrodes. Wavelet coherence (WC) and permutation Jaccard distance (PJD) coupling were introduced. 25 MCI patients were involved in the test; after three months, 4 subjects were observed to have progressed to AD and showed connectivity density reduction. The remaining patients did not manifest such behaviour. Leandrou et al. (2018) reviewed the methods employed in studying progression from MCI to AD. They identified that the entorhinal cortex provides a better classification and predictor for progression from MCI to AD. Minhas et al. (2018) and Sivakani and Ansari (2020) developed an autoregressive model with 3 arrangements of longitudinal data and performed a test. Then they estimated for the future biomarker and made an SVM classification for predicting progression from MCI to AD. Five-fold cross-validation was done; the AUC values generated were 88.93% and 88.13% for 2 years and 3 years of progression from MCI to AD. The study was done only for 3 years and feature selection and boosting algorithms were planned to apply. Minhas et al. (2017) introduced a supervised non-parametric method the classification of a longitudinal dataset for progression from MCI to AD. The similarity between the clusters was estimated using the Euclidean space and the feature values were selected using the linear regression method. Leave-one-out cross-validation was done; the accuracy and the precision values predicted were 93.33% and 89.66%. In the future, the missing data computation and validation with other biomarkers were to be focused on. Li et al. (2015) developed diagnosing models to find the different stages of AD using BLS and convolutional variants. MRI images were collected from the ADNI database and validated using the proposed algorithm. The outcome of this new algorithm was compared with a state-of-the-art algorithm for accuracy and training time; the authors concluded that the new proposed model was better than the compared algorithm.
A comparative analysis has been made to find defects in prediction. Logistic regression, naïve Bayes and decision tree classifiers have been used for prediction of cost-sensitive classification (Moser et al., 2008;Tuan et al., 2022;Weakley et al., 2015). Bari Antor et al. (2021) carried out a comparative result analysis with various machine learning techniques for prediction of dementia disease, and concluded that the support vector machine model is the best for dementia prediction for the OASIS dataset. The machine learning models considered were support vector machine, logistic regression, decision tree and random forest. Mahyoub et al. (2018) deployed various machine learning models on the ADNI dataset to rank the risk factors of Alzheimer's disease. Bansal et al. (2018) carried out a comparative study for the ADNI dataset using J48, multi-layer perceptron, Naïve Bayes and random forest algorithms. The conclusion was that the J48 algorithm is best for the detection of dementia. Segovia et al. (2012) compared two machine learning approaches to predicting Alzheimer's disease.
It can be observed from the literature that classifications of Alzheimer's disease have been done using various machine learning algorithms and the performance of the models has been analysed using one or two algorithms, but in our model, the model performance is analysed using four algorithms. The proposed model is compared with an existing model (Bari Antor et al., 2021) and shows better performance. This paper aims to develop a framework for classifying the various stages of Alzheimer's disease and to analyse the performance of the model using various evaluation metrics.
This paper focuses on the following tasks. We perform a classification of Alzheimer's disease using the naïve Bayes algorithm, linear regression algorithm, SVM algorithm and random forest algorithm. We use the ADNI database for the classification and carry out a 10-fold cross-validation. The class-wise performance is analysed. We compare the performance of the algorithm and the model building time for the ADNI dataset with the OASIS dataset. The classification accuracy is compared and the best algorithm is analysed using confusion matrix components. Finally, we compare the model performance with an existing model.

Proposed Model
The proposed model is shown in Figure 2. It has seven steps. In the first step, the ADNI dataset is considered for processing, then we carry out pre-processing and implement feature selection. Validation is made for the classification and the results are compared to find the best classifier. In this framework, the ADNI data pre-processing and feature selection are made; then the cross-validation is performed on the classified ADNI data. In the pre-processing stage, duplicate data are removed and then subjected to feature selection. After applying feature selection, the attributes selected are MMSE, Education and APOE, and Age. Naïve Bayes, logistic regression, SVM, and random forest (RF) algorithms are used for the classification. Ten-fold cross-validation is applied. The results obtained are compared to find the best algorithm.

Methodology
The methodology used in this proposed model is discussed below with a detailed description.

Machine learning algorithm
Machine learning algorithms will create the model for the dataset (Bari Antor et al., 2021) based on the problem. In this paper, the following algorithms are used for the classification: naive Bayes algorithm, logistic regression, SVM/SMO and random forest.

Naïve Bayes
The naïve Bayes algorithm classifies the dataset using the Bayes rule. Based on the probability observed in the training data, the classification is made using all the features. It is a supervised learning algorithm (Bari Antor et al., 2021). The classification is made based on the probability Equation (1).
where P(A|B) is the conditional probability of A given B, P(B|A) is the conditional probability of B given A, P(A) is the probability of event A, and P(B) is the probability of event B.

Logistic regression
The logistic regression algorithm is used for the classification task. This algorithm works based on probability. It is a supervised learning algorithm (Bari Antor et al., 2021). The classification is made based on the hypothesis function Equation (2).
where h0(x) is the hypothesis function.

Support vector machines (SVM)
The SVM algorithm is a classification algorithm. It works based on the segregation process. It is a supervised learning algorithm (Bari Antor et al., 2021). For the implementation of the SMV algorithm, sequential minimal optimization (SMO) is used. The classification is made based on Equation (3).

Random forest
The random forest algorithm is a classification algorithm. Using the ensemble learning concept will classify the dataset. The training process takes place based on the bagging process. It is a supervised learning algorithm (Bari Antor et al., 2021). The classification is made using Equation (4).
where is the number of instances, fi is the result returned by the model, and yi is the actual value for the instances.

Results and Discussion
The classification is made on the ADNI preprocessed dataset using the machine learning models. The performance is analysed and a comparison is made with the OASIS dataset.

Data pre-processing
Pre-processing is an essential step for processing a dataset (Sivakani & Ansari, 2020). The pre-processing is done on the ADNI dataset. In this dataset, there are 1534 instances. It is identified that some of the instances are repeated, so the duplicated data are removed using the "remove duplicates" function in the Weka tool; then the instances obtained for classification are 1343. The classification is implemented on these 1343 instances. In the OASIS dataset, there are 374 instances and all the instances are considered for evaluation.

Feature selection
Feature selection is a process of selecting the required input instances for processing (Sivakani & Ansari, 2020). The Bestfirst and CfssubsetEval methods are applied and it is found that the PTGENDER and the APOE Genotype attributes are of minimal importance for the classification, so those attributes are removed and the processing is done with the rest of the attributes.

Classification using naïve Bayes algorithm
The naïve Bayes algorithm is applied to the dataset and the classification results are displayed in Table 3. It shows that the instances classified correctly and incorrectly are 67.68% and 32.31%. The root mean squared error, the root relative squared error and the kappa statistic values are 0.33%, 78.11%, and 0.54%. The model for this algorithm was constructed with a timing of 0.1 seconds. The validation applied for the classification was 10-fold cross-validation.

Classification using logistic regression algorithm
The logistic regression algorithm is applied to the dataset and the classification results are displayed in Table 4. It shows that the instances classified correctly and incorrectly are 84.58% and 15.41%. The root mean squared error, the root relative squared error and the kappa statistic values are 0.27%, 0.78%, and 64.94%. The model for this algorithm was constructed with a timing of 1.66 seconds.

Classification using SVM/SMO algorithm
The SVM algorithm is applied to the dataset and the classification results are displayed in Table 5. It shows that the instances classified correctly and incorrectly are 87.42% and 12.58%. The root mean squared error, the root relative squared error and the kappa statistic values are 0.33%, 78.77%, and 0.54%. The model for this algorithm was constructed with a timing of 3.75 seconds.

Classification using random forest algorithm
The random forest algorithm is applied to the dataset and the classification results are displayed in Table  6. It shows that the instances classified correctly and incorrectly are 88.90% and 11.09%. The root mean squared error, the root relative squared error and the kappa statistic values are 0.25%, 59.34%, and 0.84%. The model for this algorithm was constructed with a timing of 0.92 seconds.   Table 7 shows a comparison of the algorithms. The random forest algorithm gives the best results. Figure  3 shows a comparison of the classification accuracy. The graphical representation shows that the random forest algorithm has the highest accuracy.

Class-wise performance measurement using confusion matrix
The confusion matrix is one of the evaluation metrics to measure the performance of an algorithm (Chicco et al., 2020). The performance can be evaluated for each class. The components of the confusion matrix are True Positive (TP), True Negative (TN), False Positive (TP) and False Negative (FN). Correctly predicted positive instances are referred to as TP, correctly predicted negative instances are labelled TN; wrongly predicted positive instances are FP, and wrongly predicted negative instances are FN (Chicco et al., 2020;Chicco et al., 2021;Brown, 2018). The best classification will give the TP and TN values as greater than FP and FN. In this ADNI dataset, there are four types of classes: CN, LMCI, EMCI and AD. For each class, the TP, TN, FP and FN are calculated.  Table 7 shows the confusion matrix generated for the naïve Bayes algorithm. The positive and the negative instances predicted are 909 and 434. The positive prediction has a greater value than the negative prediction. The coloured part represents the positively predicted values and the rest are the negatively predicted values. Also, the highest score for each component is analysed; the TP value is greater for the LMCI class and lesser for EMCI; the FP value is greater for the LCMI class and lesser CN; the TN value is greater for the EMCI class and lesser for LMCI; and the FP value is greater for the CN class and lesser for the AD class.  Table 9 shows the confusion matrix generated for logistic regression. The positive and negative instances predicted are 1136 and 207. The positive predication has a greater value than the negative prediction. The coloured part represents the positively predicted values and the rest are the negatively predicted values. Also, the highest score for the metrics is analysed; the TP value is greater for the LMCI class and lesser for AD; the FP value is greater for the CN class and lesser for EMCI; the TN value is greater for the AD class and lesser for LMCI; and the FN value is greater for the EMCI class and lesser for the CN class.  Also, the highest score for the metrics is analysed; the TP value is greater for the LMCI class and lesser for AD; the FP value is greater for the class EMCI and lesser for CN; the TN value is greater for the AD class and lesser for LMCI; and the FN value is greater for the LMCI class and lesser for the AD class.    Table 12 shows a comparison of the metrics to evaluate the best classification algorithm for the ADNI dataset. The evaluation metrics considered for the evaluation are the positive(Pos), negative(Neg), receiver operative characteristics(ROC), true positive rate(TPR), false positive rate(FPR), precision(Prec), recall(Rec) and F1score(F1). When comparing all the metrics for these algorithms, the random forest algorithm gives the best results.  Figure 4 shows the positive and negative classification values; the positive value should be high for the best algorithm (Hicks et al., 2022). To show the efficiency of the proposed model, the OASIS dataset (OASIS, 2021)is applied to the model and the evaluation metrics are tabulated in Table 12, which shows a comparison of the metrics to evaluate the best classification algorithm for the OASIS dataset. When comparing all the metrics for these algorithms, the random forest algorithm gives the best results.    Figure 5 shows the evaluation metrics of the algorithms for the ADNI dataset. All the evaluation metrics have the highest values for the random forest algorithm. Figure 6 shows the evaluation metrics of the algorithms for the OASIS dataset. Again, all the evaluation metrics have the highest values for the random forest algorithm. For both datasets, the random forest algorithm gives the best results. The accuracy of both the datasets for each algorithm are tabulated in Table 13; the random forest algorithm shows the best results for both datasets.  Figure 7 shows the accuracy of the algorithms for the ADNI and OASIS datasets. The random forest algorithm shows the best result among all the algorithms. Figure 8 shows the model construction time of the algorithms for both the ADNI and OASIS datasets. For both datasets, the naïve Bayes algorithm gave the best result. The time (in seconds) taken to construct the model for both datasets for each algorithm is tabulated in Table 15; the naïve Bayes algorithm shows the best results for both datasets. In the classification, the random forest algorithm has the best performance and when comparing all the evaluation metrics, the random forest algorithm again gives the best result for both datasets. While comparing the time for constructing the model, the naïve Bayes algorithm gives the best results. Considering all the evaluation metrics, the conclusion is that among these four algorithms, the random forest algorithm gives the pre-eminent results.

Conclusions
In this paper, we made a classification on the ADNI dataset using four machine learning algorithms. Naïve Bayes, logistic regression, SVM and random forest. The dataset had a multi-class label, so multi-class classification was performed and the classes were CN, EMCI, LMCI and AD. The validation made on the dataset was 10-fold cross-validation. Then the evaluation metrics were compared and it was found that the random forest algorithm gives better results. The evaluation metrics considered were positive (Pos), negative (Neg), receiver operative characteristics (ROC), true positive rate (TPR), false positive rate (FPR), precision (Prec), recall (Rec) and F1score (F1). The correctly classified instances for naïve Bayes, logistic regression, SVM and random forest were 67.68%, 84.58%, 87.42%, and 88.90%. From the correctly classified instances, it was found that the random forest algorithm gives the best results.
A confusion matrix was generated and the positive and negative classified values were calculated for the algorithms and also for each class. The evaluation metrics were also compared and it was found that the random forest algorithm gives the best results. The OASIS dataset was applied to the proposed model with the same algorithms and the accuracy rates were 98%, 99%, 99% and 100%. The best results were again achieved by the random forest algorithm. The model construction time was compared for both datasets and the naïve Bayes algorithm gave the best results for this. In the future, we plan to perform classification using other algorithms.