Cognitive Assessment



Cognitive Assessment


Graham E. Powell



Principles of assessment

Assessment, testing, or measurement is the evaluation of the individual in numerical or categorical terms, adhering to a range of statistical and psychometric principles. Examples of measurement are, assigning people or behaviour to categories, using scales to obtain self-ratings or self-reports, using tests of ability and performance, or collecting psychophysiological readings. Even diagnosis is a form of measurement and should have various psychometric properties such as satisfactory reliability and validity. In this chapter we concentrate on cognitive or neuropsychological assessment, which typically employs standardized psychometric tests, but it is axiomatic that the basic principles are applicable to all forms of measurement without exception. For example, stating that a patient does or does not have a symptom is potentially just as much of a measurement as stating his or her IQ. It should be noted that this account is of English language tests, and readers elsewhere should note the principles but ask local psychologists what tests they use.

Psychometric tests aim to measure a real quantity—the degree to which an individual possesses or does not possess some feature or trait, such as social anxiety or spelling ability or spatial memory. This real quantity is known in classical test theory as the true score t, and the score that is actually obtained on the given test is the observed score x. It is assumed that the observed score is a function of two values, the true score plus a certain amount of error e, because no test is perfect. Therefore we have the most basic equation in psychometrics: x = t + e. The statistical aim of psychometric measurement is to keep the error term to an absolute minimum so that the observed score is equal to the true score, which happens when the error term is reduced to zero. Of course, this is never achieved, but the error term can be reduced to the minimum by making the test as reliable as possible, where reliability is simply the notion that the test gives the same answer twice.

In practice, of course, if a test were repeated many times, each occasion would give a slightly different result, depending on how the person felt, the precise way questions were asked, the details of how answers were scored, or whether there has been any lucky guessing. In other words, observed scores would cluster around the true score. Like the distribution of any variable, the distribution of observed scores would have a mean and a standard deviation. The mean is obviously the true score, and this standard deviation is called the standard error of measurement (SEM). The aim of a good test is to keep the SEM as near as possible to zero, and test manuals should state the actual SEM.

There is a relationship between SEM and the reliability of the test:

SEM = SD[check mark](1-r11)

where SD is the standard deviation of the test and r11 is the test-retest reliability of the test (expressed as a correlation coefficient ranging from −1 to +1). If the reliability of the test is perfect (+1), as can be seen the SEM will be zero:

SEM = SD[check mark](1-1) = SD[check mark]0 = 0.

Thus a test should be as reliable as possible because then the observed score will be the true score and the standard error of measurement will be zero.

An unreliable test is always useless, but if reliability can be achieved then it is worth considering the test score and, more specifically, what it measures. The degree to which a test measures what it is supposed to measure is known as validity. There may be various threats to validity. For example, a test of numeracy may be so stressful that scores are highly dependent upon the patient’s anxiety level rather than on his or her ability, or a test of social comprehension may have questions which are culturally biased and so scores may depend in part upon the person’s ethnic background.

In practice, there are various types of reliability and validity, and these are summarized in Tables 1.8.3.1 and 1.8.3.2. Further discussion can be found in Kline.(1)

Having used a reliable and valid test, the next issue is how the numbers are analysed and expressed. It has to be noted first that there are three types of scale of measurement. A nominal scale is when numbers are assigned to various categories simply to label the categories in a manner suitable for entry onto a computer database—the categories actually bear no logical numerical relationship to each other. Examples would be marital test status or ethnic background or whether one’s parents were divorced or not. Nominal scales are used to split people into groups and all statistics
are based on the frequency of people in each group. The relationship or association between groups can be examined using χ2 statistics, for example to test whether there is a relationship between being divorced and having parents who divorced. Next there is an ordinal scale, in which larger numbers indicate greater possession of the property in question. Rather like the order of winning a race, no assumptions are made about the magnitude of the difference between any two scale points; it does not matter whether the race is won by an inch or a mile. Ordinal scales allow people to be rank ordered and numerical scales can be subjected to non-parametric statistical analysis (which is that branch of statistics which makes minimal assumptions about intervals and distributions), including the comparison of means and distributions and the computation of certain correlation coefficients. Finally comes the interval scale in which each scale point is a fixed interval from the previous one, like height or speed. The types of test described in this chapter for the most part aspire to be interval scales, allowing use of the full range of parametric statistics (which assume equal intervals and normally distributed variables).








Table 1.8.3.1 Types of reliability


















Scorer or rater reliability


The probability that two judges will (i) give the same score to a given answer, (ii) rate a given behaviour in the same way, or (iii) add up the score properly. Scorer reliability should be near perfect.


Test-retest reliability


The degree to which a test will give the same result on two different occasions separated in time, normally expressed as a correlation coefficient. A reliability of less than 0.8 is dubious.


Parallel-form reliability


The degree to which two equivalent versions of a test give the same result (usually used when a test cannot be exactly repeated because, say, of large practice effects).


Split-half reliability


If a test cannot be repeated and there are no parallel forms, a test can be notionally split in two and the two halves correlated with each other (e.g. odd items versus even items). There is also a mathematical formula for computing the mean of all possible split halves (the Kuder-Richardson method).


Internal consistency


The degree to which one test item correlates with all other test items, i.e. an ‘intraclass correlation’ such as the a coefficient, which should not drop below 0.7.









Table 1.8.3.2 Types of validity
























Face validity


Whether a test seems sensible to the person completing it; i.e. does it appear to measure what it is meant to be measuring? This is in fact not a statistical concept, but without reasonable face validity, a patient may see little point in co-operating with a test that seems stupid.


Content validity


The degree to which the test measures all the aspects of the quality that is being assessed. Again, this is not a statistical concept but more a question of expert judgement.


Concurrent validity


Whether scores on a test discriminate between people who are differentiated on some criterion (e.g. are scores on a test of neuroticism higher in those people with a neurotic disorder than in those without such a disorder!). Also, whether scores on a test correlate with scores on a test known to measure the same or similar quality.


Predictive validity


The degree to which a test predicts whether some criterion is achieved in the future (e.g. whether a child’s IQ test predicts adult occupational success; whether a test of psychological coping predicts later psychiatric breakdown). For obvious reasons, these last two types of validity are often jointly referred to as criterion-reiated validity.


Construct validity


Whether a test measures some specified hypothetical construct, i.e. the ‘meaning’ of test scores. For example, if a test is measuring one construct, there should not be clusters of items that seem to measure different things; the test should correlate with other measures of the construct (convergent validity); it should not correlate with measures that are irrelevant to the construct (divergent validity).


Factorial validity


If a test breaks down into various subfactors, then the number and nature of these factors should remain stable across time and different subject populations.


Incremental validity


Whether the test result improves decision-making (e.g. whether knowledge of neuropsychological test results improves the detection of brain injury).


Having obtained a test score for someone, that score then has to be interpreted in the light of how the general population or various patient groups generally perform on that test. There are two general characteristics of a scale that have to be remembered. The first is the measure of central tendency. Typically one would consider the mean (the arithmetic average), but it is also sometimes useful to consider the median (the middle score) and the mode (the most frequently obtained score). This will be the first hint as to whether the score is normal or whether it is more typical of one group than another. However, in order to gauge precisely how typical a given score is, it is necessary to take into account the standard deviation (SD) of the test (other measures relating to the dispersion of test scores, such as the range or skew, can be considered but are not of such immediate relevance).

As long as the mean and SD of the test are known, it is possible to work out exactly what percentage of people obtain up to the observed score x. This is done by converting the observed score into a standard score z and converting the z-score to a percentile. A standard score is simply the number of SDs away from the mean m, and it will have both negative and positive values (because an observed score can be either below or above the mean, respectively). In other words, z = (x −; m)/SD. For reference, Table 1.8.3.3 gives some of the main values of z and what percentage of people score up to those values. It is this percentage that is known as the percentile and it is obtained from statistical tables. For example, a score at the 25th percentile means that 25 per cent of people score lower than that specific score. Obviously, the 50th percentile is the mean of the test. For illustration, the equivalent IQ scores (IQ scores have a mean of 100 and SD of 15) and broad verbal descriptors are also given in Table 1.8.3.3.

A knowledge of percentile scores can help to decide to which category a patient may belong. For example, if a patient completes a token test of dysphasia and scores at the 5th percentile for normal controls and the 63rd percentile for a group of dysphasics, the score is clearly more typical of the dysphasic group.

However, in clinical practice it is often not just a comparison with others that is needed, but a comparison between two of the patient’s own scores. For example, verbal IQ might seem depressed
in comparison with spatial IQ, or the patient’s memory quotient might seem too low for his or her IQ. These are known as difference scores, and their analysis is a crucial part of the statistical analysis of a patient’s profile. There are two key concepts: the reliability of difference scores and the abnormality of difference scores. Failure to distinguish between these two leads to all manner of erroneous conclusions. In brief, a reliable difference is one that is unlikely to be due to chance factors, so that if the person were to be retested then the difference would again be found. If the test is very reliable (see the previous discussion of reliability), even a small difference score, may be reliable. As a concrete example, the manual of the Wechsler Adult Intelligence Scale—Third Edition (WAIS-III)(1) indicates that a difference of about nine points between verbal IQ and performance IQ is statistically reliable at the 95 per cent level of certainty.








Table 1.8.3.3 z-scores, percentiles, IQ scores, and descriptions




































































z-score


Percentile


IQ


Description


-2.00


2.5th


70


Scores below the 2.5th percentile are deficient or in the mentally retarded range


-1.67


5th


75



-1.33


10th


80


Scores between the 2.5th and 10th percentile are borderline


-1.00


16th


85



-0.67


25th


90


Scores between the 10th and 25th percentile are low average


-0.33


37th


95



0.00


50th


100


The mean score


+0.33


63rd


105



+0.67


75th


110


Scores between the 25th and 75th percentile are in the average range


+1.00


84th


115



1.33


90th


120


Scores between the 75th and 90th percentile are high average




120+


Scores over the 90th percentile are superior


However, although a difference of this size would be reliable, this does not necessarily mean that it is abnormal and therefore indicative of pathology. The abnormality of a difference score is the percentage of the general population that has a difference score of this size or greater. Published tables,(2) show that 18 per cent of adults have a discrepancy of at least 10 points between verbal and performance IQ, so a difference of 10 points is not at all unusual. In fact, to obtain an abnormal difference between verbal and performance IQ the discrepancy has to be of the order of 22 points for adults and 26 points for children (i.e. less than 5 per cent of adults or children have discrepancy scores of this size).

Having introduced the basic concepts of psychometric assessment, this is an appropriate point, prior to the description of specific tests, at which to summarize the information that can (or should) be found in a typical test manual, and this is set out in Table 1.8.3.4.


Tests of cognitive and neuropsychological functioning


General ability and intelligence

A very useful broad screening test, especially when it is suspected that mental functions are severely compromised, is the Mini-Mental State Examination.(3,4) It is brief, to the point, and can be repeated over time to gauge change. It measures general orientation in time and place, basic naming, language and memory functions, and basic non-verbal skills, and has good norms for a middle age range, especially the elderly, with appropriate adjustment for age. The maximum score is 30, and a score of 24 or less raises the possibility of dementia in older persons, especially if they have had nine or more years of education (a score of 24 is at about the 10th percentile for people aged 65 and older).

However, the Mini-Mental State Examination is only a screening test and the presence or nature of cognitive impairment cannot be diagnosed on the basis of this test alone. A detailed cognitive assessment is provided by the Wechsler scales, i.e. the Wechsler Adult Intelligence Scale—Third Edition UK Version (WAIS-IIIUK),(1) the Wechsler Intelligence Scale for Children—IV UK Version (WISC-IVUK),(5) or the Wechsler Preschool and Primary Scale of Intelligence—Revised (WPPSI-III).(6) Outlines of the WAIS-IIIUK and WISC-IVUK are given in Table 1.8.3.5.

IQ scores themselves are very broad measures, drawing upon a wide range of functions. This does not only mean that the scores are very stable (reliable), but also that the IQ score is relatively insensitive to anything except quite gross brain damage. Rather, a careful analysis of subtest scores is needed, always bearing in mind the concepts of reliability and abnormality of difference scores. For example, it takes a subtest range of 11 to 12 points to be considered abnormal (i.e. found in less than 5 per cent of people) on the WAIS-IIIUK and the WISC-IVUK.








Table 1.8.3.4 What to expect in a good test manual
























Theory


The history of the development of the concept and earlier versions of the test



The nature of the construct and the purpose of measuring it


Standardization


Characteristics of the standardization sample, how the sampling was carried out. and how well these characteristics match those of the general population


Similar data on any criterion groups


Similar data for each age range if the test is for children


Administration


How to administer the test in a standard fashion so as to minimize variability of administration as a factor in the error term


Scoring


How to score the test, and criteria for awarding different scores, so as to minimize scorer error


Statistical properties


Means and standard deviations of all groups Reliability coefficients and how they were obtained Validity measures and how they were derived Standard error of measurement Reliability of difference scores Abnormality of difference scores Other data on the scatter of subtest scores Scores of criterion groups


Special considerations


Groups for whom the test is not suitable or less suitable, i.e. the range of convenience of the tests


Ceiling effects: at what point does the test begin to fail to discriminate between high scorers?


Floor effects: at what point does the test begin to fail to discriminate between low scorers?


Sometimes the patient may have a language disorder or English may not be his or her first language. In such circumstances Raven’s Progressive Matrices Test,(7) which is a non-verbal test of inductive reasoning (non-verbal in the sense that it requires no verbal instructions and no verbal or written answers), can be used. The present author avoids the new norms because they were not collected in the normal fashion (i.e. not in a formal test session under the direct supervision of a psychologist), but the old norms are good. The Matrices Test has the additional advantage of having an advanced version for people in the highest range of ability.(8) No non-English versions of the WAIS-IIIUK or the WISC-IVUK are available, but the non-verbal scores can be used with caution as there may be unexpected cross-cultural effects.


Speed of processing

Reasoning is not just about solving difficult problems, but also about solving them quickly; the difference between power and speed. IQ tests as above do have timed subtests sensitive to speed, but it can be useful to administer specific tests that are not quite so confounded with intellectual ability.

One example, particularly sensitive to even quite mild concussion, is the Paced Auditory Serial Addition Test (PASAT).(9,10) Here, the
client is read a list of numbers, and as each one is read out so it has to be added to the previous number and the answer spoken aloud (Table 1.8.3.6). This has to be done quickly or the next number will come along. There are several trials in which the numbers are delivered at a faster and faster pace, from one number every 2.4 s down to every 1.2 s. It sounds easy but in actuality is very demanding; even at the slowest speed the average score is only about 70 per cent correct, and this falls away to only about 40 per cent at the fastest speed. Indeed, if a patient has any significant mental slowing, they often cannot do the test at all. Obviously the test cannot be used if the patient has a stammer, or is dysarthric or innumerate.








Table 1.8.3.5 Outline of the WAIS-IIIUK and WISC-IVUK

































































































































WAIS-IIIUK


WISC-IVUK


Age range


16-89 years


6.0-16.11 years


Verbal subtests


Vocabulary


Similarities



Similarities


Digit span



Arithmetic


Vocabulary



Digit span


Letter-number sequencing



Information


Comprehension



Comprehension


Information



Letter-number sequencing


Arithmetic




Word reasoning


Non-verbal or spatial subtests


Picture completion


Block design



Digit symbol


Picture concepts



Block design


Coding



Matrix reasoning


Matrix reasoning



Picture arrangement


Symbol search



Symbol search


Picture completion



Object assembly


Cancellation


IQ score


Verbal IQ (VIQ)


Full scale IQ (FSIQ)



Performance IQ (PIQ)



Full scale IQ (FSIQ)



Index scores


Verbal comprehension


Verbal comprehension



Perceptual organization


Perceptual reasoning



Working memory


Freedom from distractibility



Processing speed


Processing speed


Mean IQ or index scores


100 (SD of 15)


100 (SD of 15)


Mean subtest scores


10 (SD of 3)


10 (SD of 3)


Test-retest reliability of IQ


0.98 for Full scale IQ


0.97 for Full scale IQ


Standard error of measurement of FSIQ


About 2.5, so all scores are about ±5 pointsa


About 2.68, so all scores are about ±5 points


Reliable differences (p <.05)


About 9 points between VIQ and PIQ


About 11 points between VCI and PRI


Abnormal differences (p <.05)


About 22 points between VIQ and PIQ


About 26 points between VCI and PRI


Validity


Highly related to other tests of ability and to criteria related to ability


Highly related to other tests of ability and to criterion groups


a 95% of the time, true scores are the observed score ±1.96 SEM. In other words, the likely true score is within the range defined by about 2 SEMs either side of the score obtained.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Sep 9, 2016 | Posted by in PSYCHIATRY | Comments Off on Cognitive Assessment

Full access? Get Clinical Tree

Get Clinical Tree app for offline access