CASE ILLUSTRATION
In compliance with the recommendation for yearly alcohol screening for adults in primary care, a family medicine residency director wants to develop a curriculum to teach her residents how to use a validated alcohol-screening tool and provide brief alcohol counseling. Given competing educational demands, she wants this curriculum to be both efficient and effective. She plans to evaluate her residents and the curriculum. .
INTRODUCTION
Medical training programs are increasingly focused on assessing clinical competencies that demonstrate a learner’s readiness for (or progression toward) independent practice. Training programs must track a multitude of competencies or “entrustable professional activities” (EPA’s) that often include both concrete observable procedures and more nuanced but essential doctoring skills, such as empathy and cultural sensitivity. Although the general principles of learner and curricular assessment hold equally true for the behavioral and social sciences (BSS), these content areas can be more difficult to operationalize and evaluate. Moreover, high-quality BSS teaching and assessment tools often require explicit, robust institutional support that relies on careful programmatic evaluations that compare institutional goals with curricular performance. Although evaluations assess a learner and/or program, the results should also iteratively drive curricular and institutional evolution.
This chapter provides guidance on designing evaluation strategies for assessing both medical learners and curricular programs with an eye toward continuous improvement. In the section “Assessment and Evaluation Planning,” we outline how to plan assessments including evaluation processes, methods, and tools for evaluation program design. In the section “Learner Assessment and Evaluation,” we describe assessment methods and instrumentation that can ensure learners have the skills needed to address behavioral and social issues that influence health. In the section “Program Evaluation,” we address how measurements can be used to assess curricular and program performance. Lastly, in the section “Educational Research and Scholarship,” we underscore the importance of educational scholarship and research as a means to move the science of educational evaluation forward.
Fundamental principles have influenced the content of each section. First, evaluation is as important as the training curricula delivered. Evaluations should not be an afterthought after the curriculum has already been delivered. Second, just as we urge our learners to practice evidence-based medicine, we urge educators to practice evidence-guided teaching. Although educational science is young, there is a growing body of research to guide selection of teaching interventions, curricular timing and “dose,” and the choice of valid and reliable assessment tools. Educational tradition and convenience are simply not sufficient to guide efficient and effective uses of precious training hours. Finally, we assert that rigorously developed, delivered, and evaluated curricula should be published and shared as scholarly peer-reviewed works to advance educators’ skills and the field.
ASSESSMENT AND EVALUATION PLANNING
Evaluation designs are critical for determining the impact of curriculum or teaching approaches on learners’ knowledge, attitudes/values, and skills. Evaluation design selection might also consider cost effectiveness, systems change, or impact on patient or clinical outcomes. Implementing a design to test established or new curricular strategies takes careful planning and measurement considerations. The best approach for evaluation design is to develop the assessment(s) as the curriculum or educational program is being developed, rather than after. Too often, evaluations are planned at the end when opportunities to revise for meaningful assessment of learner and program outcomes have passed. Early evaluation planning will ensure that programmatic goals have been articulated, learning objectives are measurable, best approaches for evaluation have been identified, and the processes for evaluation are timed for best results. Although the planning process will vary based on resources, expertise, and other contextual factors, the following four questions highlight common issues to be considered during evaluation planning.
What is the goal of the evaluation? All educators assume their training program “works” but the specific purpose of the education may vary. For example, does the intervention target changes in attitudes, knowledge, or skills? Does the program intend to demonstrate minimally sufficient competence or extraordinary talent and achievement? Will the results be formative or summative? Will the data be used to influence institutional policy or to identify strengths or weaknesses of the program?
Which theoretical models will best guide evaluation designs and specific curricular content? For example, an educator interested in determining “competence” could use the Dreyfus model of skill acquisition that starts at novice, and progresses to advanced beginner, competent, proficient, and expert. In this case, assessment tools that target a determined threshold for “competence” and benchmarked steps leading to competence would be needed. Other models that guide evaluation development include Miller’s pyramid (knows, knows how, shows how, does), Kirkpatrick’s hierarchy of evaluation (assesses level of impact of the curriculum ranging from learner satisfaction to changes in clinical outcomes), or Bloom’s taxonomy. Models that guide educational content might include the Transtheoretical (Stages of Change) Model and motivational interviewing that matches behavioral counseling to patients’ readiness to change (see Chapter 19). A demonstration of “competence” would require the use of the general Dreyfus model with specific assessment items relevant to motivational interviewing.
What are the available evaluation resources? Resources include time (for faculty and/or learners), money, faculty/evaluator skills, and buy-in from program leaders and learners. Assessments may range from time, cost, and labor-intensive standardized patient (SP) assessments to quicker and cheaper peer–peer observations. Careful attention must be given to the test characteristics of the assessment tools chosen and whether feasibility compromises the utility of the results.
What are the key methodological considerations? Even with clear goals and sufficient resources, critical decisions must be made that can affect the value of evaluations. As discussed below, psychometric properties such as reliability, validity, and fidelity of instruments must be considered. Standard setting, benchmarking, scoring, and remediation should also be determined. All four key considerations are applied to the opening Case Illustration in Table 43-1.
1. Goals of the Curriculum | 2. Theoretical Model(s) | 3. Resource Considerations | 4. Methods: Tools | Methods: Timing |
---|---|---|---|---|
To produce lasting changes in resident knowledge and skills regarding alcohol screening and brief intervention. | – Miller’s pyramid – National Institute on Alcohol Abuse and Alcoholism (NIAAA) clinician’s guide to patients who drink to excess – Stages of change and motivational interviewing | – 3 hours of large group didactic time – 4 hours of small group discussion or role play – Use of three SP cases about alcohol use | – SP examinations – Written test – Survey – Focus groups | – Pre knowledge test – Post-SP examination – Post-written examination – Post-survey and focus group |
Concepts of reliability and validity are central to designing and implementing high stakes assessments. The higher the stakes associated with the assessment (e.g., pass or fail a course/clerkship), the greater the need for the assessment to be valid and reliable. In very simple terms, validity refers to whether the assessment measures what it intends to measure. A common misconception is that assessment instruments can be considered universally valid, when, in fact, validity can only be shown for a given population and context. Validating an assessment consists of evaluating potential threats to validity to ensure that the assessment actually represents the intended construct as applied in the intended setting. Common threats to validity include construct underrepresentation (e.g., too few items or cases) and construct-irrelevant variance (e.g., flawed rating scales, poorly trained SPs).
Reliability in educational assessment is a measure of reproducibility. Statistical methods of measuring reliability determine the difference between the “true score” and the “observed score” or the amount of error in the observed score. The greater the measurement error, the less reproducible the assessment becomes. There are three different questions to be asked about the reproducibility of a given test score if it were given to the same group of students more than once: (1) Would the same students pass or fail? (2) Would the rank order from best to worst score be the same? (3) Would all the students receive the same scores? Three different theories can be used statistically to assess reliability, including classical test theory (using Cronbach’s alpha), generalizability theory, and item response theory. A psychometrics consultation can often answer these and many key evaluation development questions.
Standard setting refers to the process of determining a cut score or passing score. For traditional assessments, such as multiple-choice examinations, standard setting techniques are well developed. For any performance-based assessments to be effectively used for summative decisions, setting defensible standards is necessary. Although standard setting may require attention to psychometrics, ultimately it incorporates consideration of institutional culture, policy, and resources.
Determining the passing score means determining how many students are going to fail. The “stakes” or ramifications of failing an assessment (e.g., repeating a course, need for remediation, ability to graduate) will play the largest role in determining how to set the passing score. Two broad categories of standard setting are normative (grading on a curve) and absolute or criterion referenced. Normative grading involves determining how a cohort of learners performed and then setting a cut-point as a multiple of standard deviations below the mean. Criterion-referenced standard setting is more work intensive but usually favored for performance-based assessment, as it is based on how each learner performs against a determined competency standard. Grading can be compensatory, allowing a learner to compensate for a low performance in one domain or case with higher performance on another, or noncompensatory. Noncompensatory grading will result in a greater number of students failing the assessment. For example, if students need to pass a certain number of cases to pass an Observed Standardized Clinical Examination (OSCE), they can compensate for poor history taking with outstanding interpersonal skills. If students must pass the history and interpersonal aspects of each case to pass an examination, passing will be more difficult.
LEARNER ASSESSMENT AND EVALUATION
Ideally, learners (and programs) will be assessed at multiple points with performance data being used to iteratively drive improvements. In a competency-based model, a threshold is set that all learners must achieve; however, benchmark or milestones on the path toward competence can be used to track learner progression toward more advanced skills over time. Programs should articulate skill development pathways that repeatedly assess learners and offer remediation, as needed.
A number of validated assessment tools are available for the BSS (see Suggested Readings and resource links). Table 43-2 provides examples of specific tools that assess core BSS areas such as social attitudes and behavior change counseling. Evaluators may select validated tools or develop their own based on the specific assessment needs. Remember that assessment instruments are not universally valid or reliable. Although it is helpful to review the psychometric properties of tools as they have been previously studied, it will still be necessary to evaluate any tool’s performance when used with learners in the context of your assessment. Assessment goals (e.g., changing attitudes, knowledge, or skills), available resources, and preferred methodologies all inform the selection of measurement tools.
Assessment Tool | Description | Where to Find It |
---|---|---|
1. Attitudes Toward Social Issues in Medicine | A 63-item Likert-type attitude survey with seven subscales including social factors, interprofessionalism, and prevention. | Parlow & Rothman. J Med Educ 1974;49:385–387. |
2. Video Assessment of Simulated Encounters (VASE-R) |