Mild Cognitive Impairment Detection Using Association Rules Mining

A single Mild cognitive impairment (MCI) is a transitional state between normal cognition and dementia. The typical diagnostic procedure relies on neuropsychological testing, which is insufficiently accurate and does not provide information on patients’ clinical profiles. The objective of this paper is to improve the recognition of elderly primary care patients with MCI by using an approach typically applied in the market basket analysis – association rules mining. In our case, the association rules represent various combinations of the clinical features or patterns associated with MCI. The analytical process was performed in line with the CRISP-DM, the methodology for data mining projects widely used in various research or industry domains. In the data preparation phase, we applied several approaches to improve the data quality like the k-Nearest Neighbour, correlation analysis, Chi Merge and K-Means algorithms. The analytical solution ́s success was confirmed not only by the novelty and correctness of new knowledge, but also by the form of visualization that is easily understandable for domain experts. This iterative approach provides a set of rules (patterns) that meet minimum support and reliability. The extracted rules may help medical professionals recognize clinical patterns; however, the final decision depends on the expert. A medical expert has a crucial role in this process by enabling the link between the information contained in the rules and the evidence-based knowledge. It markedly contributes to the interpretability of the results.

likely the progression of memory loss is prevented or delayed (Lazarczyk et al., 2012). However, this concept is not easy to apply in everyday practice. The main concerns are how to recognize high-risk patients for cognitive impairment and how to identify those of them who will likely progress from MCI to dementia. MCI is a widespread disorder in the elderly population which affects 10-20% of people aged 65 or more (Albert et al., 2011). It can be recognized by using neuropsychological tests, of which the Mini-Mental State Examination (MMSE) is the most widely used in elderly primary care (PC). However, only a part of the positively tested people develops clinically overt dementia. Yet, dementia is an emerging diagnosis in modern societies and is more and more considered a public health problem. It is because it is associated with a rapid loss of individuals' functional independence and the need for long-term nursing home care, which poses a considerable burden on the wealth of their families and a financial burden on healthcare systems (Albert et al., 2011). Moreover, dementia is considered a member of the group of common aging diseases, which also includes hypertension, diabetes, cardiovascular (CV) disease and some cancers, which are known to share the common pathophysiology background and are clinically presented with overlapping comorbidities (Buchanan, 2006).
Despite the increasing awareness about these issues, the current data modelling procedures are still focused on improving the diagnostic accuracy of the MMSE and other cognitive tests. For this purpose, data from magnetic resonance imaging or other imaging techniques are usually added to the model, while information on patient sociodemographic and clinical features is rarely included (Qiu et al., 2018;Zhang et.al., 2014). One of the reasons is that a methodological framework which would be able to capture the complexity of the clinical phenotypes is yet insufficiently developed.
We aimed to identify clinical features and patterns for patients with MCI, which is assumed to improve the recognition of patients with MCI. This would be of great practical importance for PC providers who encounter most elderly people in the population and perform the screening procedure for MCI. The MMSE, if performed at population level, is time-consuming and requires high patient engagement. Also, the ability of the test to accurately distinguish between those with normal cognition and MCI is not sufficiently high (Aevarsson & Skoog, 2000). Also, knowing the clinical features of persons with MCI may indicate the pathogenetic pathways associated with this disorder, which could guide research on the factors and mechanisms of the progression from MCI to dementia. We decided to evaluate experimentally the potential of association rules mining for this task.
The whole paper is organized as follows: a short introduction of the disease with motivation and used methodology. In the related work, we present selected existing studies to support our decision to apply the association rules mining for this type of task and to eliminate known bottlenecks like the first big number of extracted rules by domain knowledge provided by cooperating experts. The performed diagnostic process is described following the CRISP-DM methodology. The conclusion summarizes our results and their usability in primary care.

CRISP-DM methodology
The analytical process was conducted in close cooperation between medical doctors, PC providers, and data analysts. Each of these roles has its knowledge, experience, and skills, e.g. medical experts provide a relevant interdisciplinary context for heterogeneous data and possible positive diagnoses. Data analytics is an iterative and interactive process that brings new, potentially useful knowledge. It is essential to define a common vocabulary and a framework for cooperation. For this purpose and based on our expertise from different domains, we decided to use the Cross-industry standard process for data mining methodology typically used in the field of data analytics (Chapman et al., 2000;Shearer, 2000). This methodology defines six main phases, specifically business understanding, data understanding, data preparation, modelling, evaluation, deployment (Fig. 1).

Fig. 1. CRISP-DM process model. Source: Authors.
Business understanding deals with a specification of business goals followed by transformation to a specific analytical task. Data understanding starts with a collection of necessary data for the specified task and ends with a detailed description, including some statistical characteristics. Data preparation is usually the most complex and the most timeconsuming phase. Generally, it takes 60-75% of the overall time. It contains data aggregation (different data samples), cleaning (missing or incorrect values), reduction (similar attributes, hidden relationships, irrelevant attributes), or transformation (derived attributes, discretization, normalization). Modelling deals with an application of suitable machine learning methods on the pre-processed data. Also, in this step it is necessary to specify the correct metrics for results evaluation, e.g. accuracy, ROC, precision, recall, etc. The evaluation phase is oriented towards the evaluation of generated models and obtained results based on specified goals in business understanding. This phase requires intensive cooperation between data analysts and domain experts. The deployment consists of the exploration of created models in real cases, their adaption, maintenance, and collection of acquired experience and knowledge.

Association rules mining
The typical representation of the association rule is X => Y, where X, Y are subsets from a whole set of items I. The quality of each generated rule is evaluated by two metrics: support and confidence. Support is an indication of how frequently the itemset appears in the dataset. Confidence is an indication of how often the rule was found to be true. The values of these metrics are set when the algorithm Apriori starts to reduce the search space. The Apriori algorithm was designed by R. Agrawal andR. Srikant in 1994 (Agrawal &Srikant, 1994) to find frequent itemsets from data. The algorithm is based on finding frequent itemsets, which represents combinations/conjunctions of the attribute category meeting the minimum support value. The basic approach uses the breadth-first search strategy (a strategy of searching for the shortest path between two nodes in the graph structure) based on measuring the support (similarity) and confidence (dissimilarity) between sets of items ( Fig. 2).

Related work
Association rule mining is a popular data mining technique because of its easily interpretable results, and it is used in many research domains. Its typical application is market basket analysis representing one of retailers' essential techniques to identify an association between offered items. We are convinced that this approach can also be successful in the medical diagnosis process. Several existing works confirm this assumption. For example, Sariyer and Tasar (2019) used the Apriori algorithm to extract hidden patterns and relations between diagnosis and diagnostic test requirements in medical data received from an emergency department. The diagnoses were grouped into 21 categories based on the ICD standard. The laboratory tests were grouped into four main categories like hemogram, biochemistry, cardiac enzyme, urine, and human excrement related. An expert evaluated all the extracted rules.
The authors concluded that understanding the association between a patient's diagnosis and diagnostic test requirements can improve decision-making and be used to support physicians. Alwidian et al. (2018) used weighted classification based on association rules algorithm for predicting breast cancer. The authors also applied a new pruning and prediction technique based on statistical measures to generate more accurate rules. Their model outperformed other association classification algorithms like CBA, CMAR, or FACA. Borah and Nath (2018) used association rules mining to generate a new set of rare association rules from updated medical databases to identify the symptoms and risk factors for three adverse diseases: cardiovascular disease, hepatitis, and breast cancer. They focused on the notion that all necessary data for association rules mining must be available at the beginning. Their algorithm can insert or delete a case online without re-executing the entire mining process. Harahap et al. (2018) focused on the importance of an appropriate selection process of required medicine based on the development of the patient's illness. They analyzed patient prescriptions to identify the relationship between the disease and the physician's medicine in treating the patient's illness. Firstly, they used the k-means algorithm for clustering to 10 diseases and then applied the Apriori algorithm for finding association rules based on support, confidence, and lift value. For grouping, they used three variablesage, gender, and disease. The highest number of instances was in cluster 0unspecified cataract disease. Between the top ten diseases (antecedent) and the related medicine (consequent) was the minimum limit value of support 20% and confidence 65%. Lakshmi and Vadivu (2017) extracted association rules from medical health records using multi-criteria decision analysis. They used a lift value to select interesting rules for the next clinical validation. In general, the authors consider this method useful for identifying precise clinical associations between medications, laboratory results, and comorbidities.

Diagnostic Process
The following chapters describe the respective phases of CRISP-DM performed by the international research team.

Business understanding
The typical diagnostic procedure of Mild cognitive impairment relies on neuropsychological testing, which is insufficiently accurate and does not provide information on patient clinical profiles. Knowing the clinical features of persons with MCI may indicate the pathogenetic pathways associated with this disorder, which could guide research on the factors and mechanisms of progression from MCI to dementia. From the business point of view, it is interesting to have these pathways supporting the decision process of medical experts. From an analytical point of view, we aimed to provide a set of diagnostic rules in a simple, understandable form to meet this business goal. In parallel, we tested and evaluated the potential of association rules mining for this type of task. This decision also covered the challenge of how to reduce the first large volume of extracted association rules. For this purpose, we applied the domain knowledge provided by an intensive collaboration framework between medical experts, data analysts, and primary care providers.

Data understanding
The study included 93 participants, 35 M/58 F, 47-89 years old (median 69 years). They were recruited from several general medical practices in the town of Osijek (about 80,000 inhabitants) in eastern Croatia, a region with high rates of CV disease. Only participants who gave their signed, informed consent were included in the study. Data collection for one person did not last longer than six months. The data for this analysis were obtained from a multicomponent dataset specifically established for practicing data mining methods. The data were collected over three years (2007)(2008)(2009)(2010).

The study was conducted following with the Declaration of Helsinki and approved by the Ethics Committee of the Faculty of Medicine, University of Zagreb (04-76/2006-396).
The data were low-cost, easily available parameters collected to determine the health status of the examined patients (Table 1). A large proportion of these parameters is routinely collected in PC electronic health records (eHRs). For example, nominal parameters that are known to influence the level of inflammation and circulation properties like age and gender, diagnoses of the main groups of chronic diseases, and continuous use of some medications, including statins (hypolipidemic agents), analgesics and anticoagulant/antiaggregant drugs (Antonopoulos et al., 2012). Information on adverse drug reactions was used to indicate inappropriate polypharmacy prescriptions, which is usually the case in multimorbid and frail elderly persons (Tjia et al., 2010). Anthropometric measurements, if not updated in eHRs, were utilized at patient encounters (Avila-Funes et al., 2009;Whitmer, 2007). Several laboratory tests were chosen to indicate age-related pathophysiologic changes, including information on the level of inflammation, nutritional status, chronic renal impairment, CV metabolic factors, and thyroid gland hormones, which are all widely cited health-related risk factors for cognitive impairment (Bugnicourt et al., 2013;Hogervorst et al., 2010;Postiglione et al., 2001;Roberts et al.,2009;Umegaki, 2014). The results of some laboratory tests were used for this, as these are part of regular chronic disease surveillance programs. For more specific biochemical and hematological tests, patients were referred to the central laboratory of the Osijek Clinical Hospital for a venipuncture. All laboratory tests were performed according to the standard procedures. The medical experts used creatinine clearance and serum homocysteine as measures of renal function decline. Increased serum homocysteine concentrations (hyperhomocysteinemia) were reported as a CV risk factor (Postiglione et al., 2001). Insulin measurements in a fasting state were obtained to approximate the level of insulin resistance (Schrijvers et al., 2010). The degree of inflammation was indicated by C-reactive protein (CRP), total leukocyte count, and serum protein electrophoresis fractions (Roberts et al., 2009;Jain et al., 2011). We also used other laboratory parameters to enlarge the scope of possible factors for the association rules, including parameters indicating common chronic latent infections and disturbed age-related immune reactions (Cavagna et al., 2012;Deleidi et al., 2015;Futagami et al., 1998).
The participants were also screened for cognitive impairment using the MMSE, the most widely used screening test for assessing cognitive function, validated in many populations, including the Croatian elderly population (Boban et al., 2012). A score of 24 or less (out of the maximum of 30) indicates cognitive impairment. The test is relatively sensitive in diagnosing overt dementia but is less accurate in distinguishing cognitively healthy individuals from those with MCI.

Data preparation
At first, we performed an exploratory data analysis to visually clarify the relationships between some input parameters and MCI diagnosis as the output parameter (Fig. 3). As an example, we provide this figure illustrating the fact that almost all patients with MCI are 61 years or older. The generated graphs were consulted with the domain expert to detect a possible outlier or important knowledge for the modelling phase. The dataset contained 1.64% missing values replaced by the K-Nearest Neighbours (k-NN) algorithm (Altman, 1992) to improve data quality and provide the conditions for various machine learning methods. The k-NN provides a point with its closest k neighbours in a multi-dimensional space. The missing values were approximated by the values of the points (patients) closest to it based on the other available input variables. The initial value of the k parameter was experimentally evaluated.
Next, we investigated possible confounding relationships between input variables using the correlation analysis. Association rules mining, including the Apriori algorithm, requires a discretisation of the numerical attributes. Thus, the parameter of body mass index (BMI) was transformed according to the standard categorization of the World Health Organization, cited elsewhere, where BMI < 18.5 indicates underweight, 18.5 < BMI < 24.99 indicates normal weight, BMI ≥ 25 indicates overweight and BMI ≥ 30 indicates obesity. The relationships between the input parameter BMI and the target parameter MCI are depicted in Fig. 4. The median values of the parameter MCI are different in obese and overweight persons (above the cut-off score for MCI diagnosis) compared to persons with normal weight (Mann-Whitney-Wilcoxon test, p=0.07).

Fig. 2. Boxplot representing the relationships between the input parameter of BMI (x-axis) and the target parameter of MCI (y-axis). Source: Authors.
We transformed other numerical variables within two methods (Tab. 2): unsupervised K-Means clustering and supervised Chi Merge. These methods were used instead of the typical discretization procedure with a fixed-length window. Some relevant components may be lost if cut-offs are incorrectly placed within the pattern.
The Chi Merge algorithm uses the χ2 statistic to discretize numeric parameters (Kerber, 1992). It accounts for significant numbers of categories (segments) when constructing discretization intervals. Because the Chi Merge algorithm tends to construct many categories, as an alternative discretization method, we used the K-Means clustering algorithm (Lloyd, 1982). For example, the parameter fasting glucose (FGlu) was divided into three ranges that are very similar to the standard distributions of blood glucose levels used in practice (Umegaki, 2014). In general, K-Means discretisation method provided the result closer to the typical cut-off values described in the literature. Modelling phase was not customized to these discretisation methods; the aim was to provide a broader view of target diagnostics.

Modelling
We applied the Apriori version available within R language (package "arules") to the several data samples prepared in the pre-processing phase as the subset containing only CV risk factors. As CV risk factors, we considered the following parameters: age, sex, Hyper, DM, Fglu, HbA1c, Chol, TG, HDL, statins, CVD, BMI, w/h, Arm cir, skinf, Anticoag, Clear, INS, HOMCIS, CRP.

Evaluation
From the data mining point of view, the crucial aspect was to reduce effectively the initial set of generated rules and provide a comprehensive overview for the domain expert. The expert provided a deeper analysis of these results and we present only part of it as an example.
A high proportion of examined patients, 37 out of 93 (39.8%), were found to be diagnosed with MCI. This proportion is higher than it is usually reported in epidemiologic studies and can be explained by a high burden of chronic diseases due to recent negative socioeconomic trends in the region. As it is elsewhere reported for cognitive disorders, women dominated over men (F/M = 25/12), and they mostly belonged to the elderly population group (67-75 years).
The conclusion which arises from these results is that patients with MCI show low-level variation and they all belong to the standard clinical framework. For example, the extracted rules shared the following six parameters: statins, age, TG, INS, CRP, and Clear, indicating CV risk factors (Tables 3 and 4). These parameters indicate the important mechanisms of ageing diseases, such as increased insulin resistance (parameters INS and TG), a low-level chronic inflammation (parameters statins and CRP), and chronic renal impairment (parameter Clear), for which evidence also suggests their involvement in the development of cognitive impairment (Antonopoulos et al., 2012;Bugnicourt et al., 2013;Roberts et al., 2009;Schrijvers et al., 2010).
The aging, chronic conditions share a common background and pathophysiology pathways (Buchanan, 2006). Therefore, if the input dataset is sufficiently large, many of the parameters extracted in the rules would be complementary to each other. In other words, these parameters cluster together, indicating the common pathways. The knowledge needed to understand the complementarity of CV risk factors is exacting and requires the active input of a medical expert.
The clinical phenotype of MCI, indicated by the rules that are presented in Table 3, is a characteristic of women, aged 70.5 or more (as shown by the parameters: age=70.5 and sex=F). The clinical features of this phenotype may be reconstructed from the interval-values of the parameters that are most frequently present in these rules, including INS, TG, Clear, and skinf (Table 3). This syndrome is characterized by increased inflammation and muscle wasting, the condition called frailty (in the rules indicated by higher interval-values of the parameters skinf and CRP). Also, we identified disturbed glucose-related metabolism, called insulin resistance, which in this syndrome is specifically presented with high serum insulin values and low values of the CV risk factors, fasting serum glucose and serum triglycerides (Walker et al., 2017;Whaley-Connell & Sowers, 2018).
In the rules related to CV risk factors performed by another discretization method (Table 4), the parameter Clear, an indicator of low renal function, is substituted with its complementary parameter HOMCIS. This alternative indicates variations in serum concentrations of the substance homocysteine and is a marker of impaired renal function (Van Guldener, 2006). This difference in the composition of the rules presented in Tables 3 and 4, concerning the alternate presence of the complementary parameters Clear and HOMCIS, may be the additional expression of the same chronic renal impairment syndrome. Another possibility is that these groups of rules indicate two variants of this syndrome, which differ from each other in the levels of serum homocysteine concentrations (Lloyd, 1982).
When we used the whole dataset, the rules (Tables 5 and 6) contained some new parameters, besides those indicating CV risk factors, indicating the even broader clinical context of MCI, and posing new hypotheses. These hypotheses must be confirmed in future studies. Many rules consisted of a limited number of uniformed phrases, indicating several independent functional units, or pathways, which operate within the pathophysiology framework of MCI.

Deployment
The cooperation between primary care providers, medical experts and data analysts resulted in a new set of knowledge confirmed in relevant conditions and based on the respective data sample. It represents only the first step in the verification and approval process, but it is essential to make these first steps to generate diagnostic prototype or, in other words, Clinical Decision Support System.

Discussion and Conclusions
The Association rule mining is a useful method for mapping potentially relevant parameters in multicomponent datasets, in problem-solving tasks, still associated with uncertainties. This preliminary study should be understood as a proof-of-the-concept. The rules analysis has been revealed, and the cognitive process underlying this analysis has been discussed. This method allows a hierarchy in complex pathophysiology networks. Our brief overview of the current situation has shown the main direction of research activities in this domainhow to filter the generated set of rules effectively. One possibility is to apply suitable methods in the preprocessing phase like clustering, feature selection, domain knowledge provided by the expert, or stored in the knowledge base. The second possibility is to filter the resulting list by a suitable pruning approach or relevant metrics like confidence, support, or lift. In general, this direction should improve the initial results.
In this study, a greater emphasis in the development of mild cognitive impairment is put on more specific pathophysiology background, including immune system disorders and malnutrition, than on a broader clinical framework, based on chronic renal impairment and long-term hypertension. The extracted rules may help medical professionals recognize clinical patterns, although the method is approximatively, and the final decision depends on the expert. The CV risk factors are the best-known risk factors for MCI. The rules that contain only these factors (Tables 3 and 4) are therefore convincing to show how the Association rule mining method works in solving complex medical problems, such as aging-associated comorbidities. When the results of the analysis of the rules containing only CV risk factors (Tables 3 and 4) and of those containing the whole dataset (Tables 5 and 6) were taken together, it was possible to recognize the two main clinical patterns, that are likely to be associated with MCI. The first one is specific for the younger patient group (63-73 years old), and the second one is specific for the older patient group (73 years old and more).
Finally, we want to stress the importance of cooperation between health services and computer science methods. This cooperation can be a win-win situation in which enough quality data will be available, common vocabulary will be defined, analytical results will be presented in a simple, understandable form, the knowledge will be applied into practice, and further integrated into a unique, reliable scenario.
New studies design is usually planned within the framework of the existing knowledge. Precisely this is an advantage of the exploratory studies such as this one, where the rule-based methods and large datasets are used to look for exciting new parameters and new concepts which are likely to go beyond the current theories. It contributes to the continuously improved and enhanced body of knowledge used in the medical diagnostics process. Specifically, in the case of mild cognitive impairment, early diagnostics is essential to prevent the serious decline of dementia. The obtained results helped us to design further research questions and experiments. In our future work, we will aim at providing a view on expected diagnostics that is as comprehensive as possible, considering all available aspects from various sides.