(1)
Cognitive Function Clinic Walton Centre for Neurology and Neurosurgery, Liverpool, UK
Abstract
This chapter examines the presentation of the results of diagnostic test accuracy studies. It emphasizes the critical importance in reporting specific details about study participants and test results. Diagnostic test accuracy in terms of specific measures of discrimination and comparison is examined for some of the key investigations currently used in dementia diagnosis, namely cognitive screening instruments (both performance based and informant based) and biomarkers based on functional neuroimaging and cerebrospinal fluid neurochemistry, in order to exemplify the utility of these measures.
Keywords
DementiaDiagnostic test accuracy studiesParticipantsResultsSensitivity and specificityLikelihood ratiosThe results of diagnostic test accuracy studies indicate whether or not they are of clinical utility. Assuming the methodology of the study to have been sound (Chaps. 2 and 3), and biases avoided as far as possible (Sect. 1.4), the presentation of the results in a meaningful fashion is critical if other clinicians are to judge whether or not the test will be useful in their practice, in other words whether the results have external validity and may be generalizable.
4.1 Participants
4.1.1 Study Duration and Setting
Study duration (i.e. when the study began and when it ended) would seem a very simple datum to collect, but in fact this was only partially reported or not reported in many studies examined in a systematic review of biomarker studies for dementia diagnosis (Noel-Storr et al. 2013).
In proof-of-concept studies, the question of study duration may initially be open-ended, since rate of recruitment of patients with specific diagnoses may be uncertain, particularly for uncommon conditions or if there are stringent inclusion/exclusion criteria (Sect. 2.1.1.1). For example, a study based in a secondary care memory clinic which aimed to evaluate patients with mild Alzheimer’s disease (AD) and amnestic mild cognitive impairment (aMCI) with a new patient self-administered cognitive screening instrument took more than 2 years to recruit 100 patients (Brown et al. 2014). With the consecutive patient samples recruited in pragmatic diagnostic accuracy studies, study duration may be less of an issue, and a fixed duration (e.g. 6 months, 12 months) is often used if referral rates are sufficient to generate an adequate study population in the specified time frame (Larner 2014a).
Study setting must be made explicit: whether population-based, in the setting of a specific community (e.g. care or retirement home), or in primary, secondary or tertiary care (e.g. general neurology clinic or dedicated memory or cognitive disorders clinic).
In order to recruit an adequate number of subjects (e.g. with an unusual condition) in a reasonable time frame, some diagnostic test accuracy studies need to be multicentre in nature. For example, a study of the diagnostic utility of functional imaging using 123I-FP-CIT SPECT (DaTSCAN) in identifying cases of dementia with Lewy bodies (DLB) recruited patients from 40 different geographical sites (McKeith et al. 2007), and this even though some authorities believe DLB to be the second most common cause of neurodegenerative dementia after AD, accounting for 10–15 % of patients seen in dementia clinics. In pragmatic diagnostic accuracy studies, recruitment of consecutive patients may ensure adequate participant numbers from a single site. Multicentre studies are subject to possible intercentre variations, for example in clinical assessment (perhaps minimised by use of widely accepted diagnostic criteria; Sect. 2.2.1) or biochemical assays (e.g. Hort et al. 2010). In these situations, harmonisation of study protocols and standardization of sample handling is required to minimise variation.
4.1.2 Demographics
Key demographic characteristics of the study population should be described in order to enhance the external validity of the study, specifically details on patient age and gender (F:M ratio) and the prevalence of the target condition (e.g. dementia, mild cognitive impairment, any cognitive impairment) in the population studied. In addition, it may be necessary for patient ethnicity and educational level to be documented, since many cognitive screening tests have not been shown to be culture-fair and/or are susceptible to the effects of patient educational level.
4.1.2.1 Age
Usually the age range and median age of the study population should be stated. Median age usually differs in populations examined in memory clinics led by old age psychiatrists compared to those led by neurologists. Systematic reviews and meta-analyses may note studies in which such demographic information is missing and omit them from data pooling. For example, in a systematic review of studies of the Addenbrooke’s Cognitive Examination (ACE) and its Revised form (ACE-R), Crawford et al. (2012) justly criticised a preliminary study (Larner 2007a) for failure to include, inter alia, details on participant gender and age, omissions rectified in a later report of this study (Larner 2013a).
Correlation between performance on cognitive screening instruments and patient age is generally observed (Table 4.1), hence the importance of stating the age structure of the population examined.
Table 4.1
Summary of correlation coefficients for patient age and scores on various cognitive screening instruments examined in pragmatic diagnostic test accuracy studies
r | Performance | t | p | Reference | |
---|---|---|---|---|---|
MMSE | −0.23 | No | 3.63 | <0.001 | Larner (2012a) |
MMP | −0.26 | No | 4.06 | <0.001 | Larner (2012a) |
ACE-R | −0.32 | Low | 4.47 | <0.001 | Hancock and Larner (2015) |
M-ACE | −0.32 | Low | 3.90 | <0.001 | Larner (2015a) |
6CIT | 0.33 | Low | 5.55 | <0.001 | Abdel-Aziz and Larner (2015) |
MoCA | −0.38 | Low | 4.94 | <0.001 | Larner (2012b) |
TYM | −0.30 | Low | 4.61 | <0.001 | Hancock and Larner (2011) |
H-TYM | −0.37 | Low | 2.37 | <0.02 | Larner (2015b) |
FAB | −0.21 | No | 1.44 | <0.5 | Larner (2013b) |
Poppelreuter figure | −0.13 | No | 1.29 | <0.5 | Sells and Larner (2011) |
SDI | 0.09 | No | 0.66 | >0.5 | Culshaw and Larner (2009) |
AD8 | 0.02 | No | 0.28 | >0.5 | Larner (2015c) |
FCS | −0.02 | No | 0.11 | >0.5 | Larner (2012c) |
As regards AD biomarkers, there is reported to be a correlation between age and quantitative ratings of amyloid deposition in amyloid PET scans in healthy controls (i.e. the prevalence of abnormal Abeta deposition rises in normal ageing), but not in patients with cognitive impairment (Johnson et al. 2013). No correlations between age and CSF biomarkers (Abeta42, T-tau, P-tau) were found in patients with AD or incipient AD by Mattsson et al. (2009), but age correlated with T-tau and P-tau in controls, and in stable MCI age correlated with all three CSF biomarkers.
4.1.2.2 Gender
The sex ratio of study participants (F:M) should be stated. Population-based studies suggest that dementia is more prevalent in women, driven largely by the prevalence of Alzheimer’s disease (Lobo et al. 2000), and perhaps related to their greater longevity. Hence for purposes of external validity, a female predominance in dementia diagnostic test accuracy studies may be desirable.
Since women are generally more aware of health-related issues and show greater willingness to address them than men, it is not surprising that they constitute the majority of patients seen in general neurology clinics (Larner 2011:27–8,43–4; Fig. 4.1a). However, in the author’s experience, cognitive clinics generally show a slight preponderance of male patients (Fig. 4.1b), and very few diagnostic test accuracy studies of clinical signs and of cognitive screening instruments performed in the author’s cognitive clinic have shown a female preponderance amongst the participants (e.g. Larner 2007b, 2012d). The reason(s) for this reversed ratio might be speculated upon: as women are more aware of health-related issues they may be more able and willing to detect memory problems in their spouses and bring them to medical attention, than vice versa.
Fig. 4.1
Referrals by patient gender to (a) general neurology clinic 2001–2010 (Larner 2011) and (b) cognitive neurology clinic 2009–2014
On the other hand, for certain dementia disorders, particularly behavioural variant frontotemporal dementia, there is an acknowledged male predominance, which should be reflected in diagnostic test accuracy studies (e.g. Larner 2013b).
The possibility of a gendered behavioural difference in the performance of certain tests has some empirical support, for example the “head turning test” as a marker of cognitive impairment is more frequently observed and may have greater diagnostic utility in women than men (Abernethy Holland and Larner 2013; Larner 2014a:61–3).
4.1.2.3 Disease Prevalence
Prevalence of the target condition in the study population should also be stated. Generally prevalence, or pretest probability (from which pretest odds are easily calculated: Sect. 2.1.1.2), increases from community samples, to primary care samples (both low), to secondary care samples (high).
Disease prevalence may change over time. In the general population, dementia prevalence is rising as the global population ages (e.g. Prince et al. 2013), and more diagnoses of dementia are being made in England according to figures from the Health and Social Care Information Centre (http://www.hscic.gov.uk/article/4902/Number-of-patients-with-recorded-diagnosis-of-dementia-increases-by-62-per-cent-over-seven-years).
However, these population-based changes may not be mirrored in other settings. For example, dementia prevalence in cohorts sampled in a secondary care memory clinic has fallen over time from high (ca. 50 %) to low prevalence (<25 %) over the decade 2003–2013 (Menon and Larner 2011; Larner 2014a:14–5, 2014b), as is reflected in a cumulative sum (cusum) plot of annual cohorts seen in this clinic (Fig. 4.2). These changes probably reflect changes in referral practice from primary to secondary care, possibly as a consequence of the influence of nationally issued directives on dementia such as the National Dementia Strategy (Larner 2014b).
Fig. 4.2
Cusum plot: dementia diagnoses in CFC referrals, 2009–2014
4.1.3 Participant Loss
Participant loss is an inevitable fact of clinical studies. Such losses need to be documented, ideally by means of a flow diagram, so that readers can see where and for what reasons dropouts have occurred.
Participant loss may be due to patient-related factors, including declining the invitation to participate in the study, failing the inclusion/exclusion criteria for the study on initial screening, or withdrawal at some point during the study protocol. For example, in the index study of the Addenbrooke’s Cognitive Examination (ACE), of 210 patients attending the clinic, only 139 (=66 %) fulfilled the study criteria (Mathuranath et al. 2000). Such considerations may affect external validity of diagnostic test accuracy studies.
Participant loss may be due to investigator-related factors, including failure to apply the index test and/or the reference standard, or to apply them correctly according to the operationalization of the study, and loss of patient data (administrative failings).
Reasons for subjects who met study inclusion criteria but who did not undergo either index test or reference standard are poorly reported (Noel-Storr et al. 2013).
4.2 Test Results
Ideally test results should be presented in such a way that they are easily understandable and suitable for inclusion in systematic reviews and meta-analyses. Some journals require a summary of key points in addition to narrative text, and this will include the “headline” results, as will the article abstract (Sect. 1.2).
4.2.1 Interval Between Diagnostic Test and Reference Standard
The time interval between administration of the diagnostic test and the reference standard (Sect. 2.1) would seem a fairly simple datum to collect. Ideally they should occur on the same day. This is eminently feasible for administration of cognitive screening instruments, but more often there is some time lapse, particularly for investigative tests. For example in a study of the diagnostic utility of functional imaging using 123I-FP-CIT SPECT in identifying cases of DLB, diagnostic test and reference standard were reported to be administered “within a few weeks” of each other (McKeith et al. 2007). In a study of the relationship between AD biomarkers and cognitive screening instruments (Galvin et al. 2010), the time range between clinical assessment and amyloid (PiB-PET) imaging was 0–30 months (mean 5.1 ± 9.7 months) and for CSF studies was 0–22 months (mean 0.8 ± 6.9 months). The concern about prolonged intervals between diagnostic testing and reference standard relate to the possible effects of disease progression and/or the administration of any treatments in the meantime.
Systematic reviews and meta-analyses may note studies in which information about the interval between diagnostic test and reference standard is missing and omit them from data pooling. For example, in a systematic review of studies of the Addenbrooke’s Cognitive Examination and its Revised form, Crawford et al. (2012) criticised a preliminary study (Larner 2007a) for failure to include details on the interval between administration of diagnostic test and reference standard, an omission rectified in a later report of this study (Larner 2013a).
4.2.2 Distribution of Disease Severity
Documentation of severity of disease is particularly desirable in diagnostic test accuracy studies in dementia because cognitive impairment is a process rather than an event and hence often of changing severity over time. Hence at time of recruitment to a study, participants may show considerable heterogeneity with respect to disease severity, particularly in pragmatic studies recruiting consecutive clinic attenders. This variation in disease severity may have an impact on study results expressed in terms of parameters such as sensitivity and specificity.
There are various staging systems for dementia, of which the Clinical Dementia Rating (CDR; Hughes et al. 1982; Morris 1993) has perhaps gained the widest overall current acceptance and use. CDR is a global staging measure, rather than a purely neuropsychological test instrument. It is based on a combination of patient assessment and caregiver interview, rating memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care. About 40 min is needed to gather the required information, which may make it unsuitable for use in clinical, as opposed to research, environments. Ratings range from 0 to 3. A CDR score of 0.5 (questionable dementia) correlates, although is not necessarily synonymous, with mild cognitive impairment. CDR = 1 is reported to have good sensitivity and specificity in screening for dementia (Juva et al. 1995) and the test is reliably and consistently scored (Schafer et al. 2004). CDR has been used as a reference standard when assessing new diagnostic tests, such as the AD8 informant interview (Galvin et al. 2006, 2010).
Small numbers of participants in diagnostic test accuracy studies may preclude examination of test performance with stage of disease. However, a test with low sensitivity in a pragmatic study may not be anticipated to be useful for identification of early cases of disease (e.g. Sells and Larner 2011:21).
Dementia severity may not be measured in pragmatic diagnostic accuracy studies when the purpose of the study is to examine all-comers irrespective of severity, thus reflecting day-to-day clinical practice (Larner 2012b:395). Alternatively, a simple dichotomisation between dementia and mild cognitive impairment (or cognitive impairment no dementia, or mild cognitive dysfunction) may be made, based on operationalized clinical criteria with or without CDR administration. This distinction may have clinical significance, increasingly so should disease-modifying therapy suitable for application in mild stages of disease become available.
4.2.3 Cross Tabulation and Dropouts
Cross tabulation is the process of cross-classifying the diagnostic test results and the reference standard. If both of these are binary, as in the simplest case, then a standard 2 × 2 data table is produced (Fig. 3.1) with every case classified as a true positive (TP), a false positive (FP), a true negative (TN), or a false negative (FN). If there are missing or indeterminate results from either process, a more complex table may be required (e.g. 2 × 3, 3 × 3, 2 × 4; Sect. 3.2).
Some explanation of the reasons for and handling of missing data should be provided. For example, more dropouts may occur with patient self-administered tests than in clinician-administered tests, such as the Test Your Memory (TYM) test (4.5 % patient dropout rate in study of Hancock and Larner 2011). Handling of missing or indeterminate data in diagnostic test accuracy studies is poorly reported (Noel-Storr et al. 2013).
4.2.4 Adverse Effects of Testing
Some information about the acceptability of the index test to patients and/or carers should be incorporated into any diagnostic test accuracy study report. The presence or absence of adverse events in diagnostic test accuracy studies are poorly reported (Noel-Storr et al. 2013).
For clinic based tests such as administration of cognitive screening instruments or questionnaires adverse effects may include factors such as time to administer and any effects on the running of the clinic (increased consultation time, slowing the smooth running of the clinic). For investigations such as imaging and EEG, factors such as tolerability and safety become more pertinent, and for invasive tests such as CSF analysis or brain biopsy morbidity (such as pain, haematoma, low pressure headache, infection) and even mortality may need to be mentioned. Some neuroimaging modalities may be uncomfortable and hence difficult to tolerate (e.g. MRI: noise, claustrophobia), and some require injection of radioactive tracers (e.g. FDG- and amyloid-PET).
Application of the reference standard may also be associated with adverse effects, and comparison of these with other tests may be important in considering whether new tests are likely to be widely adopted, irrespective of comparisons of measures of discrimination.
4.3 Estimates of Diagnostic Accuracy
Some examples of the use of the various estimates of diagnostic accuracy, or measures of discrimination (Sect. 3.3) and comparison (Sect. 3.4), are considered here.
For illustrative purposes, the focus is on key investigations currently used in dementia diagnosis: cognitive screening instruments (both performance based and informant based); functional and structural neuroimaging; and cerebrospinal fluid (CSF) neurochemistry. Whilst most of these are cross-sectional assessments, and hence assess inter-individual change, informant tests have the capacity to be longitudinal (or “ambispective”: Sect. 2.1.4) and hence may assess intra-individual change. As previously mentioned (Chap. 1), screening tests and diagnostic tests serve different purposes, but nevertheless both categories may be submitted to test accuracy studies, and both are in common clinical usage, and hence both are included here.
Since recently published (IWG-2) diagnostic criteria for Alzheimer’s disease (Dubois et al. 2014) imply that this diagnosis cannot be made (at least in a research setting) without access to either cerebrospinal fluid (CSF) biomarkers (Abeta, total-tau, and/or phospho-tau, or ratios thereof) or some form of amyloid positron emission tomography (PET) imaging (e.g. with the Pittsburgh B compound, PiB (Zhang et al. 2014), or florbetapir, also known as AV-45 or Amyvid), since these are part of the pathophysiological signature of the disease, diagnostic test accuracy studies of these investigational modalities will be particularly referred to. Likewise, studies of 123I-FP-CIT SPECT (DaTSCAN or ioflupane), since the revised International Consensus Criteria for dementia with Lewy bodies (McKeith et al. 2005) recommended that low basal ganglia dopamine transporter (DAT) uptake seen on such imaging be used as a suggestive feature for diagnosis, and the IWG-2 Alzheimer’s criteria require abnormal dopamine transporter imaging in addition to the AD criteria for evidence of mixed (AD + DLB) dementia (Dubois et al. 2014). It may be in the future that other biomarkers may be incorporated into diagnostic criteria, for example there is increasing interest in blood-based protein biomarkers (Kiddle et al. 2014; Sattlecker et al. 2014) and serum metabolomics (Trushina et al. 2013; Trushina and Mielke 2014) but in the current context these are not discussed further.
Downstream, topographical, or progression markers of dementia (Dubois et al. 2014) include regional brain structural and metabolic changes. These may be examined by investigational modalities such as:
structural magnetic resonance imaging (MRI), including use of support vector machines applying statistical learning theory for the automatic classification of scans (Kloppel et al. 2008);
functional magnetic resonance imaging (fMRI);
proton magnetic resonance spectroscopy (1H-MRS);
fluoro-deoxyglucose positron emission tomography (FDG-PET); and
single photon emission computed tomography (SPECT) using various ligands.
For example, AD is characterised by medial temporal lobe (particularly hippocampal) atrophy on MRI, with temporoparietal hypoperfusion and hypometabolism on SPECT and FDG-PET respectively. However, because these changes have low pathological specificity they do not form part of the IWG-2 diagnostic criteria for AD (Dubois et al. 2014), although they may be used to measure disease progression, and hence may potentially be used as outcome measures in treatment trials.
Inevitably, more sophisticated investigational modalities are largely if not exclusively confined to research settings and some tertiary care institutions. The lack of universal availability will limit their use, and hence of diagnostic criteria dependent upon them. In contrast, cognitive and non-cognitive screening instruments may potentially be used in a broader range of settings (community, primary and secondary care) since they require no more than pen and paper and/or a laptop for administration and scoring, with occasional tests being entirely verbal (and hence even suitable for administration by telephone). It is therefore likely that these tests will remain in widespread use, despite their screening rather than (clinicobiological) diagnostic role (Chap. 1), as reflected in their lower specificity, but they may have high sensitivity for patients requiring additional more sophisticated diagnostic investigations. Many such screening instruments are available (Burns et al. 2004; Kelly and Newton-Howes 2004; Hatfield et al. 2009; Ismail et al. 2010; Tate 2010; Lischka et al. 2012; Larner 2013c; Moyer 2014), only a small selection of which are mentioned here, focussing particularly on those in frequent use (Shulman et al. 2006; Ismail et al. 2013).
4.3.1 Significance Tests: Null Hypothesis Testing
One possible way to express the outcome of diagnostic test accuracy studies is to use significance testing based on the null hypothesis (Sect. 3.1). In studies addressing phase I/II questions, this approach may be used to compare aggregate data from a group with the target disease and a control group. In studies addressing phase III questions, this approach may be used to compare aggregate data from groups with dementia/no dementia, cognitive impairment/no cognitive impairment, dementia/mild cognitive impairment, mild cognitive impairment/no cognitive impairment, or to compare groups with specific diagnoses (e.g. Alzheimer’s disease/frontotemporal lobar degeneration). Some examples of such significance testing for a variety of cognitive and non-cognitive screening instruments examined in pragmatic diagnostic test accuracy studies are shown in Table 4.2.
Table 4.2
Significance testing based on null hypothesis for test scores from pragmatic diagnostic test accuracy studies examining various cognitive and non-cognitive screening instruments, in patient groups with: (a) dementia/no dementia; (b) any cognitive impairment/no cognitive impairment; (c) dementia/mild cognitive impairment; (d) mild cognitive impairment/no cognitive impairment; (e) Alzheimer’s disease/frontotemporal dementia; (f) behavioural variant frontotemporal dementia (bvFTD)/non bvFTD
(a) Dementia/no dementia | |||||
Test (score range) | Mean score: dementia | Mean score: no dementia | t | p | Reference |
MMSE (0–30) | 19.7 ± 4.8 | 27.6 ± 2.8 | 15.0 | <0.001 | Hancock and Larner (2011) |
MMP (0–32) | 17.1 ± 6.4 | 26.5 ± 4.3 | 11.7 | <0.001 | Larner (2012a) |
ACE-R (0–100) | 60.5 ± 11.3 | 87.6 ± 8.2 | 15.6 | <0.001 | Hancock and Larner (2011) |
M-ACE (0–30) | 13.6 ± 5.2 | 21.8 ± 5.5 | 6.66 | <0.001 | Larner (2015a) |
TYM (0–50) | 23.2 ± 12.3 | 40.2 ± 8.2 | 44.1 | <0.001 | Hancock and Larner (2011) |
Poppelreuter figure (0–4) | 3.32 ± 1.09 | 3.85 ± 0.36 | 3.67 | <0.001 | Sells and Larner (2011) |
Global CBI (0–324) | 99.3 ± 54.0 | 59.1 ± 34.8 | 3.48 | <0.001 | Hancock and Larner (2008) |
IQCODE (1–5) | 4.10 ± 0.43 | 3.76 ± 0.44 | 4.52 | <0.001 | Hancock and Larner (2009a) |
PHQ-9 (0–27) | 4.1 ± 5.4 | 7.8 ± 7.9 | 2.80 | <0.01 | Hancock and Larner (2009b) |
Global PSQI (0–20) | 5.1 ± 4.2 | 7.6 ± 5.1 | 4.6.4 | <0.001 | Hancock and Larner (2009c) |
(b) Any cognitive impairment/no cognitive impairment | |||||
Test/Score range | Mean score: any cognitive impairment | Mean score: no cognitive impairment | t | p | Reference |
MMSE (0–30) | 23.6 ± 3.8 | 27.7 ± 2.1 | 6.62 | <0.001 | Larner (2012b) |
MoCA (0–30) | 18.3 ± 4.5 | 25.2 ± 3.2 | 12.0 | <0.001 | Larner (2012b) |
(c) Dementia/mild cognitive impairment | |||||
Test/Score range | Mean score: dementia | Mean score: mild cognitive impairment | t | p | Reference |
MMSE (0–30) | 22.2 ± 3.9 | 25.3 ± 3.1 | 2.02 | <0.05 | Larner (2012b) |
MMP (0–32) | 17.1 ± 6.4 | 24.0 ± 3.7 | 5.2 | <0.001 | Larner (2012a) |
M-ACE (0–30) | 13.6 ± 5.2 | 17.1 ± 5.3 | 2.56 | <0.02 | Larner (2015a) |
MoCA (0–30) | 16.6 ± 4.4 | 20.4 ± 3.8 | 3.19 | <0.01 | Larner (2012b) |
TYM (0–50) | 23.2 ± 12.3 | 37.5 ± 6.2 | 6.9 | <0.001 | Hancock and Larner (2011) |
(d) Mild cognitive impairment/no cognitive impairment | |||||
Test/Score range | Mean score: mild cognitive impairment | Mean score: no cognitive impairment | t | p | Reference |
MMSE (0–30) | 24.9 ± 3.2 | 27.1 ± 3.2 | 3.3 | <0.01 | Larner (2012a) |
MMP (0–32) | 24.0 ± 3.7 | 27.1 ± 4.2 | 3.6 | <0.001 | Larner (2012a) |
M-ACE (0–30) | 17.1 ± 5.3 | 24.4 ± 3.7 | 8.48 | <0.001 | Larner (2015a) |
TYM (0–50) | 37.5 ± 6.2 | 41.1 ± 8.6 | 2.4 | <0.01 | Hancock and Larner (2011) |
(e) Alzheimer’s disease/frontotemporal dementia | |||||
Test/Score range | Mean score: Alzheimer’s disease | Mean score: frontotemporal dementia | t | p | Reference |
IADL Scale (0–14) | 9.7 ± 3.4 | 10.5 ± 4.4 | 0.65 | >0.5 | Larner and Hancock (2008) |
Global CBI (0–324) | 93.6 ± 53.1 | 101.2 ± 56.3 | 0.44 | >0.5 | Hancock and Larner (2008) |
IQCODE (1–5) | 3.94 ± 0.39 | (bvFTD) 4.34 ± 0.31 | 3.25 | <0.01 | Larner (2010) |
(f) bvFTD/non bvFTD | |||||
Test/Score range | Mean score: bvFTD | Mean score: non bvFTD | t | p | Reference |
FAB (0–18) | 9.06 ± 3.34 | 11.66 ± 3.84 | 2.27 | <0.05 | Larner (2013b) |
There are a number of problems with this approach, particularly in the pragmatic setting. Test scores or results may not be normally distributed, complicating the analysis. Such results from aggregate data may not be particularly helpful for the diagnosis of individual patients.
However, a possible advantage may be to answer pragmatic questions. For example, the differential diagnosis of depression and dementia is one which not infrequently arises in clinical practice. Patients attending a cognitive disorders clinic were administered a depression rating scale, PHQ-9 (Kroenke et al. 2001); the null hypothesis that the proportion of patients with at least moderate depression did not differ significantly between patients diagnosed by reference standard with dementia (6/49 = 0.12) and without dementia (26/64 = 0.41) was examined, and rejected (χ2 = 11.3, df = 1, p < 0.01). Hence PHQ-9 scores may help clinicians decide which patients presenting to cognitive disorders clinics merit a trial of antidepressant medication (Hancock and Larner 2009b), as may scores on the Cornell Scale for Depression in Dementia (Hancock and Larner 2015). Similarly, the Pittsburgh Sleep Quality Index (PSQI) may indicate which patients have significant sleep disturbance as a potential contributor to memory complaints which may be amenable to treatment (Hancock and Larner 2009c).
4.3.2 Measures of Discrimination
4.3.2.1 Accuracy; Net Reclassification Improvement (NRI)
Although the term “accuracy” may feature in the titles of papers examining diagnostic tests, this is often a shorthand (or misnomer) for “sensitivity and specificity”, with overall accuracy or correct classification accuracy not actually reported.
In other studies, accuracy may be a secondary outcome measure, with sensitivity and specificity the primary outcome measures. For example in a multicentre study of 123I-FP-CIT SPECT imaging in patients fulfilling DSM-IV criteria for dementia as well as criteria for at least one dementia subtype (DLB, AD, or vascular dementia), the mean accuracy of three readers was stated to be 0.857 (McKeith et al. 2007).
Accuracy of cognitive screening instruments for dementia has been examined in a series of pragmatic diagnostic test accuracy studies, with values ranging from 0.78 (DemTect) to 0.89 (Addenbrooke’s Cognitive Examination-Revised; Table 4.3a).
Table 4.3
Overall maximal accuracy of (a) various cognitive screening instruments and (b) non-cognitive screening instruments examined in pragmatic diagnostic test accuracy studies for dementia diagnosis; (a) includes surrogate measures of administration time
(a) Cognitive screening instruments | ||||
Accuracy for dementia diagnosis (95 % CI) | Total Score | Number of items/questions | Reference | |
MMSE | 0.86 (0.81–0.90) | 30 | 21 | Larner (2012a) |
MMP | 0.86 (0.81–0.91) | 32 | 23 | Larner (2012a) |
ACE | 0.84 (0.80–0.88) | 100 | 52 | Larner (2007c) |
ACE-R | 0.89 (0.85–0.93) | 100 | 66 | Larner (2013a) |
M-ACE | 0.84 (0.78–0.91) | 30 | 10 | Larner (2015a) |
6CIT | 0.80 (0.75–0.85) | 28 | 7 | Abdel-Aziz and Larner (2015) |
DemTect | 0.78 (0.71–0.86) | 18 | 13 | Larner (2007b) |
MoCA | 0.81 (0.75–0.88) | 30 | 22 | Larner (2012b) |
TYM | 0.83 (0.78–0.88) | 50 | 25 | Hancock and Larner (2011) |
(b) Non–cognitive screening instruments | ||||
Accuracy for dementia diagnosis (95 % CI) | Reference | |||
IADL Scale | 0.69 (0.64–0.75) | Hancock and Larner (2007) | ||
CBI | 0.62 (0.54–0.69) | Hancock and Larner (2008) | ||
PHQ-9 | 0.62 (0.53–0.71) | Hancock and Larner (2009b) | ||
CSDD | 0.59 (0.52–0.64) | Hancock and Larner (2015) | ||
PSQI | 0.63 (0.58–0.69) | Hancock and Larner (2009c) | ||
IQCODE | 0.67 (0.59–0.74) | Hancock and Larner (2009a) | ||
Accuracy for any cognitive impairment | ||||
AD8 | 0.67 (0.60–0.73) | Larner (2015c) |
Examining the relationship between overall test accuracy for the diagnosis of dementia versus no dementia and surrogate measures of test administration time (total test score, total number of test items/questions), there were positive correlations between accuracy and the measures of administration time. This suggests that there may be a trade-off between time and accuracy in cognitive assessment using such instruments (Larner 2015d).
In a meta-analysis of studies of the cognitive screening instruments ACE, ACE-R, and MMSE (Larner and Mitchell 2014), in high prevalence settings such as memory clinics where the prevalence of dementia may be 50 % or higher, overall accuracy favoured ACE-R (0.916) over ACE (0.872) and MMSE (0.895).
Accuracy of non-cognitive screening instruments, including some informant scales, for dementia has been examined in a series of pragmatic diagnostic test accuracy studies, with values ranging from 0.59 (Cornell Scale for Depression in Dementia) to 0.69 (Instrumental Activities of Daily Living Scale), hence poorer accuracy than cognitive screening instruments (Table 4.3b, compare with Table 4.3a).
Accuracy of CSF biomarkers for incipient Alzheimer’s disease in MCI patients may be calculated from published data (Mattsson et al. 2009:389; Table 4.4, row 1). None appears particularly accurate at the cutoffs used. Different methods for determining the cutoff (as examined for this dataset by Bartlett et al. 2012) might of course produce different results.
Table 4.4
Diagnostic parameters for CSF biomarkers for diagnosis of incipient AD in MCI patients
CSF Assay | Abeta42 ≤482 ng/L | Phospho-tau ≥52 ng/L | Total-tau ≥320 ng/L |
---|---|---|---|
Accuracy | 0.71 (0.68–0.75) | 0.60 (0.57–0.64) | 0.67 (0.63–0.70) |
Sensitivity | 0.79 (0.75–0.84) | 0.84 (0.80–0.88) | 0.86 (0.81–0.90) |
Specificity | 0.67 (0.63–0.71) | 0.47 (0.43–0.51) | 0.56 (0.52–0.60) |
Youden index | 0.46 | 0.31 | 0.42 |
Positive Predictive Value | 0.58 (0.53–0.63) | 0.47 (0.43–0.52) | 0.52 (0.48–0.57) |
Negative Predictive Value | 0.85 (0.82–0.89) | 0.84 (0.80–0.88) | 0.87 (0.84–0.91) |
Predictive Summary Index | 0.43 | 0.31 | 0.39 |
Positive Likelihood Ratio | 2.41 (2.20–2.62) | 1.59 (1.44–1.75) | 1.94 (1.77–2.13) |
Negative Likelihood Ratio | 0.31 (0.28–0.34) | 0.34 (0.31–0.37) | 0.26 (0.23–0.28) |
Diagnostic Odds Ratio | 7.80 (7.15–8.51) | 4.68 (4.24–5.16) | 7.56 (6.888.29) |
CUI+ | 0.46 (poor) | 0.40 (poor) | 0.45 (poor) |
CUI− | 0.57 (adequate) | 0.39 (poor) | 0.488 (poor) |
AUC ROC | 0.78 (0.75–0.82) | 0.76 (0.72–0.80) | 0.79 (0.76–0.83) |
A preliminary study of amyloid PET imaging using florbetapir in a small series of patients who subsequently underwent post-mortem brain examination suggested a very high accuracy of positive PET scans for pathologically confirmed AD (28/29 correctly classified, hence accuracy = 0.97; Clark et al. 2011). In a subsequent larger (“phase 2”) study (n = 184), Johnson et al. (2013) looked at visual and quantitative ratings of florbetapir amyloid PET scans in patients with AD dementia, MCI, and healthy controls. The accuracy results are shown in Table 4.5 (a and b, row 1).
Table 4.5
Diagnostic parameters for florbetapir amyloid PET imaging visual (vAbeta+/−) and quantitative (qAbeta+/−) ratings for diagnosis of (a) AD dementia/no dementia; (b) any cognitive impairment/no cognitive impairment
(a) AD dementia/no dementia (MCI + healthy controls) | ||
Florbetapir amyloid PET | AD dementia – vAbeta+ | AD dementia – qAbeta+ |
Accuracy | 0.76 (0.72–0.79) | 0.73 (0.66–0.79) |
Sensitivity | 0.76 (0.68–0.83) | 0.84 (0.74–0.95) |
Specificity | 0.76 (0.71–0.80) | 0.69 (0.61–0.77) |
Youden index | 0.52 | 0.53 |
Positive Predictive Value | 0.50 (0.38–0.62) | 0.47 (0.36–0.58) |
Negative Predictive Value | 0.91 (0.85–0.96) | 0.93 (0.88–0.98) |
Predictive Summary Index | 0.41 | 0.40 |
Positive Likelihood Ratio | 3.09 (2.21–4.32) | 2.73 (2.07–3.61) |
Negative Likelihood Ratio | 0.32 (0.23–0.45) | 0.23 (0.17–0.30) |
Diagnostic Odds Ratio | 9.55 (6.82–13.4) | 12.1 (9.18–16.0) |
CUI+ | 0.38 (poor) | 0.40 (poor) |
CUI− | 0.68 (good) | 0.64 (good) |
(b) Any cognitive impairment (AD dementia + MCI)/no cognitive impairment (healthy controls) | ||
Florbetapir amyloid PET | Cognitive impairment – vAbeta+ | Cognitive impairment – qAbeta+ |
Accuracy | 0.68 (0.61–0.75) | 0.67 (0.61–0.74) |
Sensitivity | 0.54 (0.48–0.64) | 0.60 (0.51–0.69) |
Specificity | 0.86 (0.78–0.94) | 0.77 (0.68–0.86) |
Youden index | 0.40 | 0.37 |
Positive Predictive Value | 0.84 (0.75–0.93) | 0.78 (0.69–0.87) |
Negative Predictive Value | 0.59 (0.50–0.68) | 0.59 (0.50–0.69) |
Predictive Summary Index | 0.43 | 0.37 |
Positive Likelihood Ratio | 3.90 (2.19–6.93) | 2.63 (1.70–4.07) |
Negative Likelihood Ratio | 0.53 (0.30–0.94) | 0.52 (0.34–0.80) |
Diagnostic Odds Ratio | 7.34 (4.13–13.1) | 5.08 (2.63–7.85) |
CUI+ | 0.46 (poor) | 0.47 (poor) |
CUI− | 0.50 (adequate) | 0.46 (poor) |
A study of neuropsychological, neuroimaging (FDG-PET), and CSF neurochemistry in patients with MCI followed up for about 2 years compared each test for prediction of conversion to dementia and of cognitive decline (Landau et al. 2010). Overall, the accuracy of the various tests in this delayed verification study ranged from 0.74 to 0.90, with a test of episodic memory (Auditory Verbal Learning Test) proving more accurate (=0.90) than CSF chemistry (CSF total tau:Abeta1-42 ratio = 0.81) and FDG-PET (=0.76; this study did not include amyloid PET).
Richard et al. (2013) used a net reclassification improvement (NRI) methodology (Sect. 3.3.1) to quantify the incremental value of MRI of entorhinal cortex volume and CSF phospho tau:Abeta ratio after cognitive testing with Rey’s Auditory Verbal Learning Memory Test in a group of patients with MCI followed up for over 3 years to predict progression to AD. In isolation, all tests improved diagnostic classification, but using the NRI it was shown that after the memory testing (0.21) the MRI parameter hardly affected diagnostic accuracy (0.22) and the CSF biomarker actually decreased diagnostic accuracy (0.19). It would be interesting to repeat such a NRI analysis using clinicobiological markers such as amyloid PET imaging and other CSF parameters such as total tau:Abeta1-42 ratio. Using the method of Richard et al. (2013), NRI for various cognitive screening instruments examined in pragmatic diagnostic test accuracy studies is shown in Table 4.6.
Table 4.6
Prior (pretest) probability (prevalence), posterior probability (accuracy), and net reclassification improvement (NRI; after method of Richard et al. 2013) for various cognitive screening instruments examined in pragmatic diagnostic accuracy studies for diagnosis of (a) dementia/no dementia; (b) any cognitive impairment/no cognitive impairment; (c) mild cognitive impairment/no cognitive impairment
(a) Dementia/no dementia | ||||
Prior (pretest) probability (= prevalence) | Posterior probability (= Accuracy) | NRI | Reference | |
MMSE | 0.35 | 0.82 | 0.47 | Larner (2013a) |
MMP | 0.23 | 0.86 | 0.63 | Larner (2012a) |
ACE | 0.49 | 0.84 | 0.35 | Larner (2007c) |
ACE-R | 0.35 | 0.89 | 0.54 | Larner (2013a) |
M-ACE | 0.18 | 0.84 | 0.68 | Larner (2015a) |
6CIT | 0.20 | 0.80 | 0.60 | Abdel-Aziz and Larner (2015) |
DemTect | 0.52 | 0.78 | 0.26 | Larner (2007b) |
TYM | 0.35 | 0.83 | 0.48 | Hancock and Larner (2011) |
Poppelreuter figure | 0.28 | 0.72 | 0.44 | Sells and Larner (2011) |
(b) Any cognitive impairment/no cognitive impairment | ||||
Pretest probability (= prevalence) | Posterior probability (= Accuracy) | NRI | Reference | |
MMSE | 0.43 | 0.79 | 0.36 | Larner (2012b) |
MoCA | 0.43 | 0.81 | 0.38 | Larner (2012b) |
(c) Mild cognitive impairment/no cognitive impairment | ||||
Pretest probability
Stay updated, free articles. Join our Telegram channelFull access? Get Clinical TreeGet Clinical Tree app for offline access |