Chapter 16 – An Overview of Machine Learning Applications in Mood Disorders




Abstract




Advances in our understanding of the human body and technology have revolutionized modern medicine, allowing us to easily treat many conditions that were once considered a death sentence. The use of an improved understanding of biological processes and the development of disease biomarkers has led to the growth of “precision medicine” – which enables the ability to produce more objective diagnoses through individualized treatments that are more efficient and effective. The core concept of integrating precision medicine into the diagnosis and treatment of disease is now a commonplace and growing in many areas of medicine, notably the use of genomics in oncology.





Chapter 16 An Overview of Machine Learning Applications in Mood Disorders


Natasha Topolski , Su Hyun Jeong , and Benson Mwangi



16.1 Machine Learning: An Answer to Historic Challenges in Psychiatry?


Advances in our understanding of the human body and technology have revolutionized modern medicine, allowing us to easily treat many conditions that were once considered a death sentence. The use of an improved understanding of biological processes and the development of disease biomarkers has led to the growth of “precision medicine” – which enables the ability to produce more objective diagnoses through individualized treatments that are more efficient and effective. The core concept of integrating precision medicine into the diagnosis and treatment of disease is now a commonplace and growing in many areas of medicine, notably the use of genomics in oncology. However, diagnosis and treatment in psychiatry remains largely dependent on observable subjective symptoms and without objective biomarkers (1, 2). In addition, individual variability among patients contributes to a wide variation in patient responses to psychiatric treatment. For example, after initial treatment, over 50% of patients with major depressive disorder do not reach remission (35). Psychiatric research studies have suggested that there are biologically defined “subgroups” or “bio-types” of mental disorders, an observation that has pushed for a shift to classify psychiatric conditions as “brain disorders” (2). In order to elucidate these subgroups, the National Institute of Mental Health developed the Research Domain Criteria (RDoC) that aims to determine the mechanisms that result in dysfunction through basic science rather than symptomatology. The RDoC framework calls for research that integrates behavioral, biological, and environmental factors to facilitate the development of objective measures of psychopathology (6). Most noticeably though, such an undertaking requires massive data collection and data analysis methods that go beyond the abilities of traditional statistical and data analysis methods. Consequently, machine learning (ML) techniques have provided a promising avenue to analyze large datasets acquired in psychiatric research and support new discoveries. Briefly, ML is a branch of computer science and artificial intelligence that involves developing and validating algorithms that can learn from patterns gleaned from large datasets and subsequently allow predictions on previously “unseen” observations (7). Therefore, due to their ability to handle high-dimensional and large datasets, ML techniques and algorithms are well suited to be a key player in the redefinition of clinical tools used in the diagnosis and treatment of mood disorders (7). In this chapter, we will briefly discuss key concepts used in ML and explore how such concepts and ensuing tools are used in the study and treatment of mood disorders.



16.2 Machine Learning Techniques


ML techniques can be classified into three broad categories, namely supervised ML, unsupervised ML, and reinforcement learning. In this section, we briefly explore these broad categorizations and introduce specific use cases for such methods in the context of research in mood disorders.



16.2.1 Supervised ML


In supervised learning, a ML algorithm is developed and “trained” using a set of observations with corresponding labels. For example, in the context of a mood disorders study, a set of observations may represent neuroimaging scan data from healthy controls and bipolar disorder (BD) patients coupled with corresponding labels (BD +1, healthy controls −1) (Figure 16.1). These observations are subsequently used to “train” an algorithm to recognize characteristics in data, in this case, the neuroimaging scans that differentiate the target groups (e.g., healthy control vs. BD patients). The resulting “trained” algorithm is evaluated using a subset of “novel” labeled observations not included in the algorithm “training” process (8),(9). The most commonly used supervised ML techniques in the mood disorders domain include support vector machines (SVMs), relevance vector machines (RVMs), Elastic Net, and Least Absolute Shrinkage Selection Operator (LASSO) among others as highlighted in Table 16.1. Typical clinical and research applications of supervised ML currently include disease predictive classification (e.g., healthy vs. bipolar disorder (10)) and decoding of continuous clinical scales (e.g., Beck Depression Inventory (11)) using biological data such as neuroimaging scans.





Figure 16.1 A supervised machine learning training protocol in mood disorders where a ML model/algorithm is “trained” to separate patients with mood disorders from healthy controls.




Table 16.1 Common methods used in machine learning pipelines


































Methods Model details and categorization (e.g., supervised or unsupervised)
Linear regression models


  • Regression analysis is a branch of classical statistics where a model formula is developed that characterizes the relationship between a set of independent variables (predictors) and dependent variables (outcomes) that can be plotted as a line. In machine learning, regression can be utilized as an element of the most basic form of supervised learning where regression models developed from training data can be used to predict outcomes of new input data



  • Common models: General linear regression, regularized regression (e.g., least absolute shrinkage and selection operator (LASSO) or Elastic Net)



  • For more information: Statistical learning with sparsity: the LASSO and generalizations (16)



  • Categorization: Supervised

Linear and nonlinear kernel-based models


  • Kernel-learning ML algorithms use a “kernel function” to convert the selected input predictors or features into a similarity matrix known as a “kernel” that is used to develop classification rules. Kernel functions can vary depending on the data and may include both linear and nonlinear functions (e.g., polynomial or Gaussian)



  • Common models: Support Vector Machine (SVM), Relevance Vector Machine (RVM)



  • For more information: An introduction to support vector machines and other kernel-based learning methods (17)



  • Categorization: Supervised

Decision trees


  • Decision trees are models that learn through simple heuristics or decision rules. Therefore, deeper decision trees lead to more complex decision rules and result into better ML models



  • Common models: Random Forest, Adaboost



  • For more information: The Elements of Statistical Learning (18)



  • Categorization: Supervised

Additive models


  • Additive models are flexible statistical models often used to characterize nonlinear data. In additive models, multiple functions are added together to create a smoother model that fits the data better than any of the individual functions. Each function in the model retains its form, allowing for relatively simple interpretability (18)



  • Common models: Generalized Additive Model



  • For more information: The Elements of Statistical Learning (18)



  • Categorization: Supervised

Artificial neural networks and deep learning


  • Artificial neural networks (ANNs) are designed to recognize nonlinear patterns in a dataset and make appropriate predictions (e.g., disease vs. healthy control classification) by mimicking how the human brain processes information. These artificial neurons are arranged into multiple layers, where the layer that receives input data is referred to as the input layer while the output layer returns predicted results. It is common practice to have many layers between the input and output layers with those in between referred to as the hidden layers. The most recent category of ANNs, which we refer to as deep learning neural networks, involves utilizing multiple hidden layers (i.e., thousands or millions) of neural networks.



  • Common models: Feedforward Neural Networks, Convolutional Neural Networks



  • For more information: Deep learning for neuroimaging: a validation study (19)



  • Categorization: Supervised

Multivariate data dimensionality reduction


  • There are many multivariate data dimensionality reduction techniques used in mood disorders and in particular neuroimaging such as; principal component analysis (PCA (20)), independent component analysis (ICA (21)), multidimensional scaling (MDS (22)), local linear embedding (LLE (23)), nonnegative matrix factorization (NNMF (24)) and t-distributed stochastic neighbor embedding (t-SNE (25)). We briefly explore the most common multivariate dimensionality reduction techniques used in neuroimaging and mood disorders research (i.e., ICA and PCA). Specifically, ICA is a multivariate data-driven dimensionality technique, which belongs to the broader category of blind-source separation methods (21, 26) that are used to separate data into underlying independent information components. ICA separates a set of “mixed signals” (e.g., raw data from an fMRI scan) into a set of independent and relevant features (e.g., behavioral paradigm-related signals in fMRI). On the other hand, PCA is a dimensionality reduction technique, which transforms correlated variables into a smaller subset of variables that are not correlated also referred to as principal components. The resulting principal components can capture most of the variance in the data and are often linear combinations of the original or raw data (27). In summary, these techniques are commonly used to separate relevant signal from noise (i.e., denoising) as well as overcome the “curse-of-dimensionality” or “small-n-large-p” problems highlighted earlier in this chapter



  • Common models: principal component analysis, independent component analysis



  • For more information: A review of feature reduction techniques in neuroimaging (15)



  • Categorization: Unsupervised

Multidimensional data clustering


  • Multidimensional data clustering is a form of unsupervised ML, which entails grouping observations that are “similar” in a higher dimensional space (e.g., >3 dimensions) into clusters or groups. The characteristics that determine group similarities may include distance measures such as the Euclidean distances among observations or statistical distributions. There are several data clustering algorithms such as K-means (28), mean shift (29), and hierarchical clustering (30). However, K-means is by far the most commonly used data clustering algorithm in this field. However, it is common practice to perform data dimensionality reduction using PCA, ICA, t-SNE, or other techniques before implementing data clustering.



  • Common models: K-Means, Mean-Shift, Hierarchical Clustering



  • For more information: Phenomapping: Methods and measures for deconstructing diagnosis in psychiatry (31)



  • Categorization: Unsupervised

Model evaluation metrics


  • Machine learning algorithms are often evaluated using multiple metrics largely depending on the use case. Briefly, in a supervised predictive classification ML application (e.g., predicting MDD patients from healthy controls), it is a common practice to use prediction accuracy, specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), receiver-operating characteristic curves (ROC), and area under the ROC (AUROC). These evaluation metrics are also commonly used in biostatistics and diagnostic medicine as described elsewhere (32). On the other hand, supervised predictive regression ML applications that are used to predict or decode continuous variables or outcomes (e.g., Beck Depression Inventory) use other classical statistical measures such as the Pearson correlation coefficient, coefficient of determination, mean absolute error (MAE), and root mean square error (RMSE) (28). In a nutshell, these metrics compare the statistical relationship between “actual” continuous variables against the supervised ML predicted variables. However, the evaluation metrics used in unsupervised ML are comparatively different as they do not have a “ground truth” (e.g., comparisons between actual vs. predicted variables). Therefore, in unsupervised data clustering, the silhouette index value (SIV) (33) is by far the most popular metric used to quantify the number of clusters in a dataset as well as cluster validity. Briefly, the SIV quantifies the similarity of a data point to other points within its own cluster as compared to data points in other data clusters. Recent unsupervised ML studies in mood disorders have largely used this metric to evaluate ML model outcomes (3437). Other data clustering metrics include, Dunn’s cluster validity index [38], Davies–Bouldin index (38, 39), gap statistic (40), and the C-index (41). For a review on unsupervised ML evaluation metrics the reader is pointed to (42).



  • Common metrics: Prediction accuracy, Specificity, Sensitivity, ROC, AUROC, Silhouette Index Value



  • For more information: Pattern recognition and machine learning (43)



16.2.2 Unsupervised Learning


Unlike supervised ML, where the input data are labeled (e.g., disease +1 vs. healthy −1), in unsupervised ML, the input data are not labeled, and the main goal is to find hidden patterns within a dataset. Therefore, unsupervised ML techniques largely utilize data dimensionality reduction methods (e.g., principal component analysis) coupled with data clustering techniques (e.g., K-means) to identify hidden patterns and clusters within a dataset. Unsupervised ML techniques have recently been used to identify unique biological groupings or clusters in mood disorders – also known as “biotypes” (12).



16.2.3 Reinforcement Learning


Reinforcement learning entails “training” an algorithm that is able to take specific actions that maximize cumulative rewards. Notably, these algorithms mimic the human decision-making process where there are often an arbitrary number of actions to choose from and eventually learn from positive outcomes (i.e., reward) or negative outcomes (i.e., punishment). Typical examples of reinforcement learning applications have included mapping of positive and negative prediction errors to the firing of dopaminergic neurons in mood and affective disorders (13),(14). Most recently, reinforcement learning algorithms are increasingly being used to select optimal treatments (e.g., antidepressants) as they mimic the trial and error process used in selecting treatments during clinical practice.


However, despite the three categories of ML algorithms highlighted earlier (i.e., supervised, unsupervised, and reinforcement learning), below we highlight three overarching concepts used in ML to practitioners in establishing and validating ML algorithms before they are reported in research products or deployed for clinical purposes. Here we introduce these concepts.



16.2.4 Selection of Algorithm Training and Validation Samples


An “objective” ML algorithm is the one that is able to “generalize” results to a novel or new sample that it was not previously exposed to. Therefore, in order to develop an “objective” ML algorithm in a ML-related project, the first step entails splitting a dataset into independent “training” and “validation” sets. The “training” set is used in “training” the algorithm by identifying the best algorithm parameters whilst the “validation” set is used to establish whether the final algorithm/model is generalizable by making accurate and objective predictions. Consequently, it is a common practice to separate a dataset into two groups (i.e., training and validation sets) before embarking on a ML project.



16.2.5 Feature and Data Dimensionality Reduction


Raw data, particularly in specific psychiatric research domains such as neuroimaging and genomics are often acquired in high dimensions (e.g., >100,000 voxels) and may also contain measurement noise. In the context of ML, this problem is also referred to as the “curse-of-dimensionality” or “small-n-large-p” problem where there are significantly large number of predictors (e.g., neuroimaging voxels) as compared to a low number of observations (i.e., subjects) (15) . This may greatly hamper a ML algorithm as it’s not able to identify a best-fit solution – a problem also known as overfitting (15). Therefore, to circumvent this problem, data dimensionality reduction and feature reduction tools such as principal component analysis (PCA) or univariate t-test among other techniques are often employed to extract a subset of features or predictors (e.g., neuroimaging voxels) that are meaningful to the ML task at hand. The subset of features or predictors extracted using the data dimensionality or feature reduction techniques are subsequently used to “train” a ML model instead of the original raw data. Previous research studies on this domain have shown that these feature reduction techniques lead to ML models with higher accuracy and better generalization ability by being able to make accurate predictions from previously “unseen” observations in a validation sample.



16.2.6 Model Training and Parameter Optimization


Training a ML algorithm entails establishing parameters that can maximize prediction accuracy and promote model generalizability to a novel or previously “unseen” sample. Therefore, to achieve this goal, it is common practice to use cross-validation methods that support selection of “best-fit” parameters. Therefore, N-fold cross-validation (e.g., 10-fold or 5-fold) is often used by randomly separating the data into N subgroups while the algorithm is “trained” on N − 1 subgroups and tested on the left-out group. This is repeated so that each group is left out to estimate model prediction errors and accuracy across the N trials. Upon completion, the model and model parameters with the highest accuracy or least errors are selected to establish the final model. The final accuracy on a validation sample determines the generalizability of the model. In Table 16.1, we have briefly outlined key ML techniques and their categorizations.



16.3 Applications of Machine Learning Techniques to Neuroimaging and Clinical Data in Mood Disorders



16.3.1 Diagnostic Classification of Mood Disorders, Decoding Clinical Variables, Identification of Unique Disease Subtypes and Supporting Mechanistic Understanding


Despite recent progress, our understanding of the mechanistic pathophysiology of major mood disorders such as BD and major depressive disorder (MDD) still remains limited. Early neuroimaging studies used mass-univariate statistical methods coupled with neuroimaging scan data to elucidate critical insights on brain structural and functional differences between patients with mood disorders and comparative healthy controls. For example, through these studies, fronto-limbic structural abnormalities in BD patients were reported (44). In addition, volumetric and structural connectivity in the anterior cingulate cortex (ACC) in patients with MDD have also been reported (45). More recently, neuroimaging studies have leveraged ML techniques to classify or distinguish individual patients with mood disorders from healthy controls. For example, through a systematic review with fifty-one research studies, Librenza-Garcia and colleagues observed that ML coupled with structural and functional neuroimaging scans can accurately differentiate BD patients from healthy controls and other psychiatric diagnosis such as MDD (10). Another recent systematic review observed gray matter volume reductions in bilateral insula, right superior temporal gyrus, bilateral anterior cingulate cortex, and left superior medial frontal cortex in MDD and BD patients as compared to healthy controls (44). Predictive white matter abnormalities in the genu of the corpus callosum were also observed in both MDD and BD patient groups (44). Other studies have attempted to predict or decode continuous clinical rating scales from neuroimaging scans. This is followed by a subsequent examination of brain regions involved in predicting such clinical rating scales. For example, Mwangi and colleagues (11) reported prediction of the self-reported Beck Depression Inventory (BDI) using structural neuroimaging scans coupled with a kernel-based relevance vector regression ML algorithm in patients with MDD. This study reported correlation between actual BDI scores and predicted BDI scores at Pearson correlation coefficient = 0.694 and significant at p < 0.0001. Furthermore, the medial frontal, superior temporal gyrus, and parahippocampal gyrus were heavily involved in decoding the BDI scores in patients with MDD. In another study (46), BDI and Snaith-Hamilton Pleasure Scale (SHAPS) were accurately predicted in a cohort of fifty-eight patients with MDD using a supervised linear regression ML technique and functional connectivity data and identified several functional networks associated with anhedonia and negative mood as the main contributors. Another study predicted Functioning Assessment Short Test (FAST) (47) from a cohort of thirty-five patients with BD type I using a supervised support vector regression ML algorithm and structural neuroimaging scan data (48). The FAST score is used to measure functional impairment in BD and was predicted by volumetric reductions in the left superior and left rostral medial frontal cortex as well as right lateral brain ventricular enlargements. This indicates that a supervised ML algorithm together with structural neuroimaging scans can predict functional impairment in BD patients. In a similar pattern, multinational studies from the Enhancing NeuroImaging Genetics through Meta-Analysis (ENIGMA) consortium have also reported successful diagnostic classification of MDD (49) and BD (48, 50) patients as compared to healthy controls using neuroimaging scans from thousands of patients acquired from multiple centers around the world.


Recently, there has been a shift in psychiatric research toward identification of data-driven disease subtypes also referred to as phenomapping, which has partly been inspired by the NIMH’s RDOC criteria (6). Therefore, researchers have leveraged unsupervised ML techniques such as multivariate data-dimensionality reduction coupled with high-dimensional data clustering algorithms capable of identifying unique disease subtypes in BD and MDD. For instance, Wu and colleagues (37) used an unsupervised ML approach to cluster neurocognitive data derived from BD-I and BD-II patients into two distinct subtypes. Subsequently, the data derived subtypes were validated using a linear regression Elastic Net ML algorithm coupled with fractional anisotropy (FA) and mean diffusivity (MD) measures of brain diffusion tensor imaging (DTI) with 92% and 75.9% accuracy, respectively. Abnormalities in the inferior fronto-occipital fasciculus and minor forceps of the corpus callosum white matter tracts of patients with BD were found to be major contributors in separating the two data-derived subtypes of BD from healthy controls. In another study (51), a data-driven approach was used to identify transdiagnostic subtypes of mood disorders that span multiple clinical diagnoses. This study applied a hierarchical data clustering algorithm to identify unique subgroups that were subsequently validated using an independent sample. However, although there are promising results from the phenomapping literature, we still need to remain cautiously optimistic as attempts to replicate such disease subtypes in independent samples have in some cases not been successful (52).



16.3.2 Prediction of Treatment Response


Prediction of treatment response, such as being able to identify individual patients with MDD that are likely to have a positive response to a particular antidepressant is a well-documented problem in psychiatry (53),(54). Therefore, in the past decade, a plethora of studies in mood disorders have employed ML techniques to predict individual patients’ likelihood of responding to antidepressants or mood stabilizers. For instance, Webb and colleagues (55) examined whether a ML technique can recommend individualized treatment in a eight-week trial of sertraline versus placebo with a cohort of 216 depressed individuals. This study observed that a ML technique can identify a subset of MDD patients that are optimally suited for sertraline primarily based on a few clinical and demographic variables. Another study using the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) dataset (56) developed a supervised gradient boosting ML algorithm to predict patients who may benefit from citalopram following a twelve-week course of treatment. The ML algorithm achieved an accuracy of 64.6% and twenty-five clinical variables were selected by the Elastic Net ML algorithm as the top contributors to the observed accuracy. Numerous other studies have used a similar supervised ML approach to predict patients likelihood of response to antidepressants in MDD (5760), electroconvulsive therapy in MDD (61, 62), and lithium in BD (63) using structural/functional neuroimaging scans, electroencephalogram (EEG), and clinical/demographic data. Although it’s not a common practice in psychiatry, recent studies in oncology are beginning to use reinforcement learning algorithms to implement automated radiation adaptive protocols in treating lung cancer (64). The reinforcement learning approach may be particularly well suited for adaptive protocols in MDD as it mimics the current gold standard of selecting optimal antidepressants through a “trial and error” process (65). Lastly, although there is significant progress in optimizing treatments for patients with mood disorders using ML techniques, the majority of studies have largely used retrospective data and resulting ML models have not been translated into actual clinical practice.



16.3.3 Prediction of Other Clinical Outcomes Such As Suicide, Medication Side Effects and Clinical Staging


ML techniques have also been a powerful asset at assessing and predicting other clinical outcomes such as suicidality and medication side effects, and, to some extent, recent studies have been successful at establishing disease stages. Two recent studies used large electronic medical records (EMR) datasets as input predictors with a number of supervised ML algorithms (e.g., Elastic Net, Random Forest and LASSO) and managed to predict suicide risk among patients in a psychiatric hospital or emergency department with specificity and sensitivity greater than 0.7 (66, 67). Interestingly, Passos and colleagues (68) reported accurate predictions (accuracy = 72%, sensitivity = 72.1%, and specificity = 71.3%) at predicting individual suicide attempters in a preliminary study with a cohort of 144 patients with BD and MDD. The kernel-based relevance vector ML technique used in this study identified previous hospitalizations for depression, a history of psychosis, cocaine dependence, and posttraumatic stress disorder (PTSD) comorbidity as the most relevant predictors of suicide attempt in mood disorders. This further highlights that ML techniques can not only aid in prediction of psychiatric patients at risk of attempting suicide but can also guide researchers to clinical factors that contribute to such events and open novel avenues for clinical interventions. Prediction of medication side effects has also shown promise as a prime application for ML techniques. For example, although lithium is a first-line form of treatment in BD, its risk for developing renal insufficiency reportedly discourages its use in treating BD (69). A study of 5,700 patients receiving treatment with lithium reported a regression ML technique-powered EMR data that was able to predict renal insufficiency risk with an area under the curve (AUC) of 0.81 (69). The authors observed that older age, female sex, history of smoking, history of hypertension, overall burden of medical comorbidity, and diagnosis of schizophrenia or schizoaffective disorder were the major contributing factors in predicting renal insufficiency among those receiving lithium treatment. This highlights that such ML tools can support clinicians to make informed decisions and facilitate the development of strategies that reduce negative outcomes such as side effects. Lastly, we highlight the use of ML techniques in predicting and validating disease stages in mood disorders. A recent study showed that structural brain scans can not only distinguish BD patients from healthy controls but also found that a subgroup of patients characterized by higher lifetime manic episodes including psychiatric hospitalizations had markedly higher gray and white matter density loss (70). The authors concluded ML coupled with structural neuroimaging scans is able to stratify BD patients into clinical stages (e.g., early stage vs. late stage BD) in line with the recently proposed clinical staging model of BD (7174).

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Jan 30, 2021 | Posted by in PSYCHIATRY | Comments Off on Chapter 16 – An Overview of Machine Learning Applications in Mood Disorders

Full access? Get Clinical Tree

Get Clinical Tree app for offline access