(1)
Cognitive Function Clinic Walton Centre for Neurology and Neurosurgery, Liverpool, UK
Abstract
This chapter examines the discussion section of papers describing diagnostic test accuracy studies. It emphasizes the need to summarise key study results, discuss the clinical applicability of the results, and highlight any potential shortcomings and limitations of the study. A brief consideration of the vicissitudes of the publication process is also offered.
Keywords
DementiaDiagnostic test accuracy studiesApplicabilityLimitationsPublication5.1 Summary of Key Results
Ideally the Discussion section of a diagnostic test accuracy study should open with a brief summary of the key findings of the study, perhaps not dissimilar to that in the abstract (Sect. 1.2).
This summary may be accompanied by a comparison with the results of any previous similar diagnostic test accuracy studies, in order to contextualise the interpretation of the study results. If findings are concordant, this may help to validate the test under study by indicating the reproducibility of results (Sect. 4.3.6). If findings are discordant, then some attempt to explain why this might be so is called for. This might relate to differences in study setting, casemix, operationalization of index test and/or reference standard, definition of test cutoffs, interpretation of test results, or any combinations thereof.
It is not unreasonable, in addition to these objective outcomes, for the authors to express a subjective impression of the study outcome (e.g. encouraging, unexpected, disappointing, startling). Any features of the study deemed to be unique (“the first diagnostic test accuracy study to examine …”) or particularly notable (“the largest/most comprehensive such study undertaken hitherto …”) may be stated, but some qualification is advisable (e.g. “to the authors’ knowledge”) as a partial defence against any accusation of hubris or contradiction.
5.2 Clinical Applicability
With any diagnostic test accuracy study there will be questions as to how far the study findings can be generalized. In other words, how relevant is your study to other practitioners in the field? Both STARD (Bossuyt et al. 2003) and STARDdem (Noel-Storr et al. 2014) recommend discussion of the clinical applicability of study findings.
STARDdem guidelines recommend that authors identify the stage of development of the test, for example proof-of-concept, or “defining accuracy in a typical spectrum of patients”, definitions which would seem to correspond with phase I/II and phase III/pragmatic studies (Sackett and Haynes 2002) respectively.
Study setting is a key determinant of clinical applicability. For example, can the results of diagnostic test accuracy studies undertaken in secondary care settings be applicable in primary care or community settings? Almost certainly not, due to differences in disease prevalence in the different settings. Meta-analyses may treat separately studies of the same cognitive screening instrument performed in different settings (e.g. MMSE: Mitchell 2013). Few cognitive and behavioural screening instruments are validated in more than one setting (van der Linde et al. 2014).
Even in similar settings, there may be differences in patient age and comorbidity between the study population and patients typically seen in clinical practice, perhaps related to participant inclusion/exclusion criteria established at the outset of the study (Sects. 2.1.1.1 and 5.3.1). Pragmatic studies will minimise any such differences, and hence produce outcomes which may be more reflective of day-to-day clinical practice.
The nature of the diagnostic test is also relevant. For some tests (neuroimaging, laboratory analyses) there may be only limited availability outside of secondary care settings, and testing for biomarkers of Alzheimer’s disease and other specific dementias may only be available in research settings and/or dedicated cognitive disorders clinics or memory centres. In contrast, tests applicable by clinicians within the clinic room, so called “bedside” tests or “near patient testing” (i.e. results available without reference to a laboratory and rapidly enough to affect immediate patient management; Delaney et al. 1999:824), include cognitive screening instruments, some of which may be applicable in primary care (Brodaty et al. 2006; Cordell et al. 2013) and the community as well as in secondary care settings.
STARDdem also recommends discussion of whether reported data demonstrate “added” or “incremental” value of the index test over and above other routine diagnostic tests. Dependent on study design, this may require comparison with data from other studies, although it is more desirable to undertake head-to-head comparison in the same patient cohort. For example, since the Addenbrooke’s Cognitive Examination (ACE) and its Revised form (ACE-R) both incorporated the Mini-Mental State Examination (MMSE), both instruments could be studied simultaneously in proof-of-concept (Mathuranath et al. 2000; Mioshi et al. 2006) and pragmatic studies (Larner 2005, 2013a). Studies gathering data on cognitive testing, neuroimaging, and CSF neurochemistry may facilitate comparison of the utility of these various diagnostic tests (e.g. Galvin et al. 2010; Landau et al. 2010; Gomar et al. 2011).
5.3 Shortcomings/Limitations
No diagnostic test accuracy study is without its limitations, inherent in the biases accepted at the time of study planning (Sect. 1.4). This is acknowledged by tools such as QUADAS and its subsequent iteration, QUADAS-2, which may be used retrospectively to assess the methodological rigour of diagnostic test accuracy studies (Whiting et al. 2004, 2011). Other checklists to rate the methodological quality of diagnostic accuracy studies may be used, such as the one based on the Scottish Intercollegiate Guidelines Network (SIGN 2007) and used by Crawford et al. (2012).
QUADAS-2 anchoring statements specific to diagnostic test accuracy studies in dementia have also been proposed to address the risk of study bias (Davis et al. 2013; Box 5.1).
Box 5.1: QUADAS-2 Anchoring Statements Specific to Diagnostic Test Accuracy Studies in Dementia (Adapted from Davis et al. (2013))
Selection:
Was a case-control or similar design avoided?
Was the sampling method appropriate?
Are exclusion criteria described and appropriate?
Index test:
Was index test assessment performed without knowledge of clinical dementia diagnosis?
Were index test thresholds pre-specified?
Were sufficient data on index test application given for the test to be repeated in an independent study?
Reference standard:
Was the assessment used for clinical diagnosis of dementia acceptable?
Was the clinical assessment for dementia performed without knowledge of the index test?
Were sufficient data on dementia assessment method given for the assessment to be repeated in an independent study?
Flow:
Was there an appropriate interval between index test and clinical dementia assessment?
Did all participants get the same assessment for dementia regardless of index test result?
Were all participants who received the index test included in the final analysis?
Were missing index test results or uninterpretable results reported?
Applicability:
Were those included representative of the general population of interest?
Was the index test performed consistently and in a manner similar to its use in clinical practice?
Was clinical diagnosis of dementia or other reference standard made in a manner similar to current clinical practice?
In reporting a diagnostic test accuracy study, it is best to highlight any shortcomings or limitations in the context of the Discussion, prior to having them pointed out by a referee or reviewer with potentially detrimental effects on the chances of the paper being accepted for publication.
Sources of potential bias (Sect. 1.4) should be briefly discussed, since no diagnostic test accuracy study can be without bias. Addressing carefully the many areas identified as being poorly reported in diagnostic test accuracy studies (Noel-Storr et al. 2013) is paramount (see Box 5.2).
Box 5.2: Elements of Diagnostic Test Accuracy Studies Found to Be Poorly Reported (Adapted from Noel-Storr et al. (2013))
Participant sampling
Training and expertise in administering index test and reference standard
Blinding to results of index test and reference standard
Methods for calculating test reproducibility for index test and reference standard
Study dates
Reasons for subjects meeting inclusion criteria who did not undergo index test and reference standard
Presence or absence of adverse events
Handling of missing or indeterminate data
Variability of diagnostic accuracy between subgroups, participants, readers, or centres
Estimates of reproducibility of index test and reference standard
5.3.1 Participants
The chosen cohort of study participants will inevitably affect the study findings. This is partly related to study setting (community, primary or secondary care; Sect. 5.2) but also to participant inclusion/exclusion criteria established at study outset. Factors which may be controlled for, such as patient age, ethnicity, presence of comorbidities, and stage of disease, may all differentiate the study population from patients typically seen in clinical practice. This will relate in part to the nature of the study, be it proof-of-concept/phase I/II or pragmatic/phase III, with less stringent inclusion/exclusion criteria in the latter, which ideally aims to recruit consecutive patients.
The inclusion of a control group, as occurs in proof-of-concept/phase I/II studies, will also need to be discussed, since this is not something which occurs in clinical practice. However, significant numbers of individuals with neither dementia nor mild cognitive impairment are referred to memory clinics, whose exact problem remains unclear but is probably heterogeneous, as reflected in the various diagnostic labels which have been applied, including “memory complainers”, “worried well”, subjective memory impairment, mild cognitive dysfunction, and functional memory disorder (Schmidtke et al. 2008). Some may in fact be in a prodromal phase of a dementia disorder (Mitchell et al. 2014).
It is generally recognised that patients with dementia included in clinical research studies are systematically younger than patients from the general population (Schoenmaker and Van Gool 2004). Patient gender ratio may also be an issue (Sect. 4.1.2.2).
To ensure homogeneity of patient cohorts, multiple testing centres may be required, with need for standardization of index test and reference standard application across centres.
5.3.2 Test Results
The chosen study design will also impact on test results. Cross-sectional studies, the idiom of clinical practice, will inevitably incur some diagnostic error, even in the most skilled clinical hands, with misclassification of patients and hence their study results, which may potentially inflate or dilute measures of discrimination. Reassessment of participants after a period of time, a longitudinal study, makes excellent clinical sense in order to review and, if necessary, to revise diagnoses, but obviously such a policy of delayed verification increases study duration and complexity, and will also lead to the problem of dropouts and how to handle missing data.
Consistent application of index test and reference standard, with operationalization of their administration if necessary, is also critical to the applicability of test results. Inevitably different clinicians will apply tests in slightly different ways (Larner 2014a), although this is less of a problem with laboratory and imaging technologies which undergo quality control and for which standardized protocols may be developed.
Because of case selection and use of a control group (i.e. the ideal circumstances of a case-control type study), proof-of-concept/phase I/II studies inevitably have better outcomes, in terms of sensitivity and specificity and other measures of discrimination, when compared to pragmatic/phase III where discrimination is more difficult, as in clinical practice.
A results related issue, which may need some consideration, concerns the apparently better outcomes for performance-based cognitive screening instruments used in cross-sectional studies as compared to informant-based scales, which might be anticipated to have “added value” because of their longitudinal/ambispective aspect, and biomarker studies, despite their apparently addressing disease biology (clinicobiological measures) and also having a longitudinal aspect (see for example Sect. 4.3.2; compare accuracy data in Tables 4.3, 4.4, and 4.5; Landau et al. 2010; Richard et al. 2013). In particular, delayed memory recall paradigms appear to have particular diagnostic accuracy for Alzheimer’s disease (e.g. Duff et al. 2008; Gavett et al. 2012). Pragmatically this may be a blessing, in that “bedside” tests of memory may be as good as currently available sophisticated technological investigations, but with greater ease of use and availability and lesser cost. This may relate in part to the nature of the estimates of diagnostic accuracy used.
5.3.3 Estimates of Diagnostic Accuracy
Comparing measures of discrimination for performance based and informant based cognitive screening instruments and for AD biomarkers, accepting that this may be comparing different things and in many cases not head-to-head, performance based tests seem to do as well as if not better than the others, despite their failure to address disease biology or to reflect fully the longitudinal process versus a cross-sectional assessment.
Since most diagnostic test accuracy studies cite test sensitivity and specificity as one, if not the key, amongst their outcome measures, as per the recommendations of both STARD (Bossuyt et al. 2003) and STARDdem (Noel-Storr et al. 2014) guidelines, something needs to be said about the selection of test cutoff(s), since this decision will critically influence sensitivity and specificity and a number of other measures of discrimination (Sect. 2.2.3). Indeed, the choice of cutoff or threshold determines most of the measures of test discrimination (Sect. 3.3), namely sensitivity, specificity, predictive values, likelihood ratios, diagnostic odds ratio, clinical utility indexes and, by extension, weighted comparison. Other measures, such as area under the receiver operating characteristic curve (AUC ROC), Q* index, and Cohen’s d, are independent of the chosen cutpoint. Cutoff determined by maximal Youden index is of equal or greater sensitivity than cutoff determined by maximal test accuracy for all cognitive screening instruments examined in the author’s clinic (Larner 2015; Table 4.7) but further studies on this point are required to confirm or refute the generality of this observation.