Principles of assessment
Assessment, testing, or measurement is the evaluation of the individual in numerical or categorical terms, adhering to a range of statistical and psychometric principles. Examples of measurement are, assigning people or behaviour to categories, using scales to obtain self-ratings or self-reports, using tests of ability and performance, or collecting psychophysiological readings. Even diagnosis is a form of measurement and should have various psychometric properties such as satisfactory reliability and validity. In this chapter we concentrate on cognitive or neuropsychological assessment, which typically employs standardized psychometric tests, but it is axiomatic that the basic principles are applicable to all forms of measurement without exception. For example, stating that a patient does or does not have a symptom is potentially just as much of a measurement as stating his or her IQ. It should be noted that this account is of English language tests, and readers elsewhere should note the principles but ask local psychologists what tests they use.
Psychometric tests aim to measure a real quantity—the degree to which an individual possesses or does not possess some feature or trait, such as social anxiety or spelling ability or spatial memory. This real quantity is known in classical test theory as the true score t, and the score that is actually obtained on the given test is the observed score x. It is assumed that the observed score is a function of two values, the true score plus a certain amount of error e, because no test is perfect. Therefore we have the most basic equation in psychometrics: x = t + e. The statistical aim of psychometric measurement is to keep the error term to an absolute minimum so that the observed score is equal to the true score, which happens when the error term is reduced to zero. Of course, this is never achieved, but the error term can be reduced to the minimum by making the test as reliable as possible, where reliability is simply the notion that the test gives the same answer twice.
In practice, of course, if a test were repeated many times, each occasion would give a slightly different result, depending on how the person felt, the precise way questions were asked, the details of how answers were scored, or whether there has been any lucky guessing. In other words, observed scores would cluster around the true score. Like the distribution of any variable, the distribution of observed scores would have a mean and a standard deviation. The mean is obviously the true score, and this standard deviation is called the standard error of measurement (SEM). The aim of a good test is to keep the SEM as near as possible to zero, and test manuals should state the actual SEM.
There is a relationship between SEM and the reliability of the test:
SEM = SD[check mark](1-r11)
where SD is the standard deviation of the test and r11 is the test-retest reliability of the test (expressed as a correlation coefficient ranging from −1 to +1). If the reliability of the test is perfect (+1), as can be seen the SEM will be zero:
SEM = SD[check mark](1-1) = SD[check mark]0 = 0.
Thus a test should be as reliable as possible because then the observed score will be the true score and the standard error of measurement will be zero.
An unreliable test is always useless, but if reliability can be achieved then it is worth considering the test score and, more specifically, what it measures. The degree to which a test measures what it is supposed to measure is known as validity. There may be various threats to validity. For example, a test of numeracy may be so stressful that scores are highly dependent upon the patient’s anxiety level rather than on his or her ability, or a test of social comprehension may have questions which are culturally biased and so scores may depend in part upon the person’s ethnic background.
In practice, there are various types of reliability and validity, and these are summarized in
Tables 1.8.3.1 and
1.8.3.2. Further discussion can be found in Kline.
(1)
Having used a reliable and valid test, the next issue is how the numbers are analysed and expressed. It has to be noted first that there are three types of scale of measurement. A
nominal scale is when numbers are assigned to various categories simply to label the categories in a manner suitable for entry onto a computer database—the categories actually bear no logical numerical relationship to each other. Examples would be marital test status or ethnic background or whether one’s parents were divorced or not. Nominal scales are used to split people into groups and all statistics
are based on the frequency of people in each group. The relationship or association between groups can be examined using
χ2 statistics, for example to test whether there is a relationship between being divorced and having parents who divorced. Next there is an
ordinal scale, in which larger numbers indicate greater possession of the property in question. Rather like the order of winning a race, no assumptions are made about the magnitude of the difference between any two scale points; it does not matter whether the race is won by an inch or a mile. Ordinal scales allow people to be rank ordered and numerical scales can be subjected to non-parametric statistical analysis (which is that branch of statistics which makes minimal assumptions about intervals and distributions), including the comparison of means and distributions and the computation of certain correlation coefficients. Finally comes the
interval scale in which each scale point is a fixed interval from the previous one, like height or speed. The types of test described in this chapter for the most part aspire to be interval scales, allowing use of the full range of parametric statistics (which assume equal intervals and normally distributed variables).
Having obtained a test score for someone, that score then has to be interpreted in the light of how the general population or various patient groups generally perform on that test. There are two general characteristics of a scale that have to be remembered. The first is the measure of central tendency. Typically one would consider the mean (the arithmetic average), but it is also sometimes useful to consider the median (the middle score) and the mode (the most frequently obtained score). This will be the first hint as to whether the score is normal or whether it is more typical of one group than another. However, in order to gauge precisely how typical a given score is, it is necessary to take into account the standard deviation (SD) of the test (other measures relating to the dispersion of test scores, such as the range or skew, can be considered but are not of such immediate relevance).
As long as the mean and SD of the test are known, it is possible to work out exactly what percentage of people obtain up to the observed score
x. This is done by converting the observed score into a
standard score z and converting the
z-score to a
percentile. A standard score is simply the number of SDs away from the mean
m, and it will have both negative and positive values (because an observed score can be either below or above the mean, respectively). In other words,
z = (
x −;
m)/SD. For reference,
Table 1.8.3.3 gives some of the main values of
z and what percentage of people score up to those values. It is this percentage that is known as the percentile and it is obtained from statistical tables. For example, a score at the 25th percentile means that 25 per cent of people score lower than that specific score. Obviously, the 50th percentile is the mean of the test. For illustration, the equivalent IQ scores (IQ scores have a mean of 100 and SD of 15) and broad verbal descriptors are also given in
Table 1.8.3.3.
A knowledge of percentile scores can help to decide to which category a patient may belong. For example, if a patient completes a token test of dysphasia and scores at the 5th percentile for normal controls and the 63rd percentile for a group of dysphasics, the score is clearly more typical of the dysphasic group.
However, in clinical practice it is often not just a comparison with others that is needed, but a comparison between two of the patient’s own scores. For example, verbal IQ might seem depressed
in comparison with spatial IQ, or the patient’s memory quotient might seem too low for his or her IQ. These are known as
difference scores, and their analysis is a crucial part of the statistical analysis of a patient’s profile. There are two key concepts: the
reliability of difference scores and the
abnormality of difference scores. Failure to distinguish between these two leads to all manner of erroneous conclusions. In brief, a reliable difference is one that is unlikely to be due to chance factors, so that if the person were to be retested then the difference would again be found. If the test is very reliable (see the previous discussion of reliability), even a small difference score, may be reliable. As a concrete example, the manual of the Wechsler Adult Intelligence Scale—Third Edition (
WAIS-III)
(1) indicates that a difference of about nine points between verbal IQ and performance IQ is statistically reliable at the 95 per cent level of certainty.
However, although a difference of this size would be reliable, this does not necessarily mean that it is abnormal and therefore indicative of pathology. The abnormality of a difference score is the percentage of the general population that has a difference score of this size or greater. Published tables,
(2) show that 18 per cent of adults have a discrepancy of at least 10 points between verbal and performance IQ, so a difference of 10 points is not at all unusual. In fact, to obtain an abnormal difference between verbal and performance IQ the discrepancy has to be of the order of 22 points for adults and 26 points for children (i.e. less than 5 per cent of adults or children have discrepancy scores of this size).
Having introduced the basic concepts of psychometric assessment, this is an appropriate point, prior to the description of specific tests, at which to summarize the information that can (or should) be found in a typical test manual, and this is set out in
Table 1.8.3.4.