Chapter 9 Acoustic voice analysis was first applied to neurologic disease early in the previous century.1 Over the last few decades into the new millennium, however, the proliferation of microcomputer-based technology has led to widespread applications of acoustical analysis methods to neurologically disordered voices. As represented below, the majority of these studies have employed measures of vocal fundamental frequency (the primary physical basis for perceived pitch), sound pressure (the primary basis for perceived loudness), and measures of cycle-to-cycle perturbations in frequency and amplitude in efforts to diagnose and assess neurologic voice disorders, and to track treatment efficacy. A smaller number of reports have also used similar measures to search for early identification of the disorders and to track disease progression. These types of acoustic research are contributing significantly toward the current development of evidence-based practice. In addition, acoustics methods contribute to intervention in the form of biofeedback signals that are used as integral components of treatment delivery. It nonetheless seems that vocal acoustic analysis techniques, as nonintrusive yet highly sensitive signs of neurologic status, are on the verge of far greater ranges of utility and acceptance. The tools for such techniques are relatively inexpensive, requiring minimally only a microphone, microcomputer, and software (some of which is even free). Several other factors may still limit the range of applications: (1) Training and expertise in the appropriate and valid processing of acoustic measures remain limited, even among speech-language pathologists, and reasoning from acoustic measure to neurophysiology is indeed fraught with many complicated inferential possibilities. (2) Too often it is assumed that acoustic measures are only confirmatory of the diagnostician’s perceptions; hence, they are considered secondary or even redundant, even though acoustic analysis can offer far greater potential for discovering and quantifying phenomena that may be difficult to perceive, that are preclinical, or that are even subliminal. (3) Analysis algorithms have traditionally relied on the automatic detection of fundamental frequency as the “normative” basis from which deviations are assessed, but such detection is highly problematic in abnormal voices. (4) Most current acoustic techniques fail to target vocal phenomena that can be clearly associated with neurophysiologic status. (5) Most such techniques are applicable to samples, such as sustained vowel phonation, that may or may not represent the in situ impairments as they affect spontaneous connected speech. This chapter does not attempt to address all of these limitations, but rather introduces the neurologist, otolaryngologist, or speech-language pathologist to some resources, techniques, and ideas that should lead to improved understanding of laryngeal status via acoustic analysis. Any measurement strategy is only as good as the user’s theoretical understanding of the object of measurement. This basic epistemologic problem is compounded many-fold in the use of acoustic measures of voice to infer neurologic status because a long chain of inference is required. For example, in the computerized analysis of the class of cycle-to-cycle period perturbations classified under the rubric of “jitter,” before direct contemplation of neurologic status, the analyst should consider (1) the nature of the algorithm used to calculate jitter, (2) any limitations imposed by the digital representation of the signal, (3) the quality of the recording equipment and environment, (4) the task performed and its elicitation, (5) the acoustic effects of vocal tract resonances that will have transformed the original laryngeal output, (6) the aerodynamic principles by which glottal flow yields an acoustic product, (7) the interaction of vocal fold movement with aerodynamic forces, (8) nonmuscular tissue characteristics of the vocal folds, and (9) the variety and complex interactions among muscle activations affecting vocal fold configuration. Each of these considerations is vital to a thorough application of acoustic analysis; although this chapter cannot provide detailed information on each of these aspects, the reader can consult a wealth of resources for expanded information.2–14 The following section provides some practical advice on acquisition and analysis of acoustic voice signals, with special attention to considerations such as those listed above. Figure 9.1 depicts some elements of a desirable recording configuration for acoustic voice evaluation, with an emphasis on requirements for meaningful representation of the sound pressure level (SPL) of vocal output. This acoustic parameter is especially important in neurologic disorders because (1) it can be of central importance in relation to many other performance variables in an assessment protocol,15,16 and (2) other acoustic parameters may systematically vary in association with SPL.17–19 As detailed in the following discussion, the figure represents a relatively simple and inexpensive protocol that approximates more elaborate procedures described elsewhere.20 For samples to be readily compared, it is important that the microphone be maintained at a consistent distance from the speaker’s mouth, but far enough away from oral and nasal airstreams so as not to be perturbed by aerodynamic effects. Head-worn microphones are especially ideal for these reasons, but they should be of high quality with a broad and flat frequency response.21 The inverse square law, which describes an exponential relation between distance from source and detected SPL, predicts an especially exquisite sensitivity at close distances to variations in microphone-to-mouth distances. Other factors, such as gain levels on recording devices, inevitably vary from session to session and patient to patient, and it is advisable to optimize gain settings during each recording to accommodate louder or softer voices. It is therefore impossible, but also undesirable, to maintain consistent settings for all the factors that affect the actual amplitude of a recorded signal, necessitating the use of a recorded signal of known intensity against which to calibrate the recorded signals. Figure 9.1 depicts the use of a calibration tone-generator with an output speaker that is ideally positioned in close proximity to the patient’s mouth. Just prior to the patient’s vocal productions, the elicitor should take a reading of this calibration tone from an SPL meter, again taking care to position this meter at a fixed and consistent distance from the output speaker for all samples to be compared (i.e., across sessions and/or patients). Community standards have come to consider relatively inexpensive SPL meters to be acceptable for clinical purposes. The waveforms and accompanying text labels in Fig. 9.1 indicate how the three resources produced by this protocol are assembled to allow the actual recorded tasks (which, in their raw computer-stored form, will be on a linear volts scale) to be placed on a standard decibel (dB) SPL metric: (1) the integrated (root-mean-square [RMS]) voltage of the calibration tone is measured from its recording, (2) the RMS quantities of interest are measured from the recorded voice samples and placed on a dB SPL scale relative to the calibration tone (20 × log(RMSvoice/RMScal), and (3) the dB value observed on the SPL meter during the calibration tone presentation is added to the dB value or values produced in step 2. A completely different source of variation that may interfere with valid comparisons between sessions comes from patients and their varying construals or performances of the desired task. Indeed, general issues of task selection and performance merit far more consideration than can be included in this chapter, affecting not just the value and meaning of acoustic parameters but the general interpretive goals of phonatory assessments. These issues are all too often overlooked and should invoke careful consideration by otolaryngologists and speech-language pathologists. The most important consideration in general acoustic terms is the fact that most acoustic measures will co-vary to some degree with the basic parameters of f0 and SPL. The voice range profile (or “phonetogram”), by which the client’s abilities to vary intensity across the entire range of physiologically available fundamental frequencies can be mapped, can be a very useful procedure,22 but for many purposes it is simply impractical. The procedure is very time-consuming and potentially too fatiguing and laborious for the patients, especially if they suffer from neurologic deficits. The different f0/SPL combinations produced by a given individual do in fact yield varying and potentially informative acoustic qualities,18,23 but the amount of measurement and resulting data may also be too laborious for the analyst (not to mention the patient), and overwhelming for the interpreter. Figure 9.2 outlines a procedure that can be followed as an abbreviated voice range profile to assay basic effects of f0 and SPL changes on vocal qualities such as jitter, shimmer, and tremors. It is especially simple in conjunction with a real-time pitch and energy program (such as KayPENTAX’s [Lincoln, NJ] Real-Time Pitch™ software24), as the protocol begins with a quick sampling of the patient’s preferred pitch and loudness settings (as obtained from repetitions of a brief sentence, for example). The real-time visual program can then be used to help the patient produce samples that deviate systematically from this central setting. The values depicted in Fig. 9.2 (one octave up, 10 dB up, three semitones down, and a “soft but high pitch”) are reasonable targets for speakers with unimpaired laryngeal control (so failure to achieve them may itself be diagnostic), and the tasks therefore allow a quick assessment of both the control limitations experienced by an impaired individual and the variations in quality that these tasks generally elicit. Although these tasks may also be accompanied by other standard tasks, such as a self-selected maximum sustained vowel production, the protocol should suppress the tendencies for such vowel productions to deviate from habitual spoken levels, to be “sung,” or to vary excessively across sessions and patients. The target of high f0/low SPL is especially valuable as a representation of vocal fold vibratory status at phonation threshold in a loft register. Productions under such “borderline” conditions can reveal laryngeal control difficulties that may not occur at the patient’s self-selected levels. Among the most widely used tools for acoustic voice analysis is the KayPENTAX Multi-Dimensional Voice Program™.25 Numerous normative and pathologic databases have been collected using this program and collated along with application notes relevant to neurologic disorders of the larynx.26 Along with the program manual itself and related statements issued by the National Center for Voice and Speech (NCVS),12 users of this program now have many resources for optimal application of the instrument. Nonetheless, other practical experiences in our laboratories and elsewhere are not consistently reflected in those resources, and so we provide a list of pointers: (1) To implement NCVS standards, one must select the “MDVP Advanced” version of the program that is automatically installed along with the standard version. (2) Clarity in task instructions is critical, especially in neurologic assessments; for example, it seems difficult even for unaffected individuals to produce steady vocal amplitude (to the normative threshold implemented in the program) unless specifically concentrating on this aspect of vocal performance. (3) It should be noted that default normative samples built in to the program database were collected with an (unspecified) target SPL level that is audibly high, often causing an apparently abnormal “soft phonation index” report when patients self-select low-amplitude phonation (and probably also lowering the normative amplitude modulation criterion13 ). (4) Following NCVS standards, the analyst should obtain and average multiple samples (at least five), continuously filling the 3-second waveform panel whenever physiologically possible, and scrutinizing each sample to exclude those with variations that can be ascribed to anomalous or nonrepresentative productions. (5) The default compact “radial” display, albeit alluring, produces distracting visual elements, obscuring logical groupings of parameters and also the visual salience of values that are significantly lower than, rather than higher than, accepted norms, and also bypassing the explicit step of selecting appropriate gender norms. Analysts are therefore strongly advised to select the “normative bars” graphic display type under the “Options, MDVP” menu. (6) It is critical to inspect the “View B” display of f0 and peak-to-peak amplitude analysis and verify optimal analysis. Under some circumstances a repeat analysis implementing optional upper and lower pitch analysis range limits is required to avoid erroneous f0 determinations. (7) Community standards seem unclear regarding whether the cycle-to-cycle pitch and amplitude variations, generally known respectively as jitter and shimmer, should incorporate the deterministic cycle-to-cycle variations caused by subharmonics or focus exclusively on the random variations that are probably more clearly associated with neurologic sources,27–29 but it is clear that inclusion of subharmonics generally overwhelms typically lower-level random variations in the final jitter and shimmer metrics.30 Although laborious, it may therefore be advisable to excise visible episodes of subharmonic vibratory patterns from waveforms in order to obtain more focused jitter and shimmer samples. (8) Not all MDVP parameters are uniquely informative. The normative bars display (see point 5, above) are arranged in logical groupings. As supported by factor analyses of MDVP output,31 analysts should observe these groupings and recognize that many items within the groups vary primarily only in technical aspects of the underlying algorithm and that they may not therefore target distinct aspects of laryngeal function. For example, general standards (as discussed in the program documentation itself) favor relative average perturbation (RAP) among the various jitter metrics and amplitude perturbation quotient (APQ) among the various shimmer metrics provided by the program. (9) Of special interest in many neurologic disorders, the frequencies and modulation depths of both f0 and SPL tremors may be presented in MDVP output (if they exceed detection thresholds), but analysts may not always avail themselves of the program controls (specifically in “View D” displaying “f0 and Amplitude Modulation Components”) that can be used to override automatic selections in favor of the modulation frequencies of greatest interest. For example, the dominant modulation frequencies seen in untreated essential tremor may not always be the dominant frequencies posttreatment, though it may nonetheless be of greatest clinical value to focus on identical frequencies when sampling across such sessions.32 Acoustical studies of neurologically disordered voices may be divided into three major categories: (1) assessment and diagnosis, (2) onset and long-term course, and (3) treatment outcome. Selected studies representing these categories are provided in Tables 9.1 to 9.3 , in which they are characterized in terms of both acoustic measures employed and the neurodiagnostic samples represented. Studies of assessment and diagnosis (Table 9.1) are most numerous. They focus mainly on characterizing the acoustic voice characteristics of specific neurologic disorders such as Parkinson disease or cerebellar ataxia. Some also quantify the severity of phonatory dysfunction within a specific disorder. A few studies have attempted to diagnostically differentiate between neurodiagnostic subgroups or to differentiate neurodiagnostic subgroups from other nonneurologic etiologies, based on acoustic voice parameters, with varying degrees of success.
Acoustic Assessment of Vocal Function
Considerations When Evaluating Acoustic Measures of Voice
Acquisition Strategies
Analysis Strategies
Literature Review
Reference | Measure | Disorder |
Rahn, Chou, Jiang, Zhang (2007)53 | jitter, shimmer, NLD | PD |
Zhang, Jiang, Biazzo, Jorgenson (2005)54 | jitter, shimmer, NLD | VFP |
Feijó, Parente, Behlau, Haussen, De Veccino, Martignago(2004)55 | spectrography, f0 var, shimmer, jitter | MS |
Lundy, Roy, Xue, Casiano, Jassir (2004)56 | f0 mean/var, f0/SPL modulations | ALS, SD, ET |
Dromey (2003)57 | f0 mean/var, SPL, jitter, shimmer, HNR, LTAS | Hypokinet. Dys. |
Kelchner, Lee, Stemple (2003)58 | f0 mean, SPL, NHR | VFP |
Heman-Ackah, Michael, Goding (2002)59 | jitter, shimmer, NHR, CPP | Dysphonia |
Hanson, Stevens, Kuo, Chen, Slifka (2001)7 | harmonic amplitudes | Spasticity, Ataxic, Athet. Dys. |
Morsomme, Jamart, Wéry, Giovanni, Remacle (2001)60 | f0/SPL mean/var, shimmer, and jitter | VFP |
Jiang, Lin, Hanson (2000)61 | SPL modulations | PD, Cerebell., UMND, ET |
Kent, Kent, Duffy, Thomas, Wiesmer, Stuntebeck (2000)62 | f0 mean/var, SPL var, jitter, shimmer | Atax. Dys. |
Murry, Sapienza, Walton (2000)63 | dynamics | SD, MTD |
Sherrard, Marquardt, Cannito (2000)64 | jitter, shimmer, HNR, f0 var, breaks | SD, PBD |
Kent, Vorperian, Duffy (1999)65 | MDVP | CVA, PD, etc. |
Sapienza, Walton, Murry (1999)66 | dynamics | SD |
Robert, Pouget, Giovanni, Azulay, Triglia (1999)67 | f0 mean/var, SPL, jitter, shimmer | ALS |
Eckley, Sataloff, Hawkshaw, Speigel, Mandel (1998)68 | f0 var | VFP |
Le Dorze, Ryalls, Brassard, Boulanger, Ratte (1998)69 | f0 mean/var | PD, Friedreich’s |
Gamboa, Jiménez-Jiménez, Nieto, et al (1998)70 | f0 mean/var, jitter, shimmer, HNR | ET |
Gamboa, Jiménez- Jiménez, Nieto, et al (1997)71 | f0 mean/var, jitter, shimmer, HNR | PD |
Hertrich, Lutzenberger, Spieker, Ackermann (1997)72 | fractal dimension | PD, cerebell. |
Walker (1997)73 | f0 var | MG |
Doyle, Raade, St. Pierre, Desai (1995)74 | f0 mean/var | Hypokinet. Dys. |
Hertrich, Ackermann (1995)75 | f0 mean, jitter, shimmer, HNR | PD, HD |
Murry, Brown, Morris (1995)76 | f0 mean/var | VFP |
Ackermann, Ziegler (1994)77 | f0 mean/var, jitter | Atax. dys., dysphonia |
Strand, Buder, Yorkston, Ramig (1994)78 | f0 mean/var, jitter, shimmer, SNR |